# Metrics

You can enable metrics on StrongDM nodes (gateways, relays, or proxy workers) in order to assist with monitoring and observability. When visualized on monitoring dashboards and mapped to alerts, metrics provide valuable insights into the status of nodes, including connection failures, disconnects, availability, and so forth. Monitoring nodes can help you to preemptively address and understand problems as soon as they arise.

This guide defines node metrics, describes common terminology related to such metrics, and provides a configuration example for enabling Prometheus-formatted metrics on a node.

After configuration is complete, you can request metrics from the node on the specified port. The `/metrics` endpoint can be reached at:

```http
http://127.0.0.1:9999/metrics
```

### Terminology

Common terminology related to node metrics is described in the following table.

| Term   | Description                                                                                                                                                                                                                                                                       |
| ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Chunk  | A data blob representing a portion of a long-running SSH, RDP, or Kubernetes interactive session recording.                                                                                                                                                                       |
| Egress | The act of a node making an outbound network connection (called an egress connection) directly to a target resource outside the StrongDM relay network. Of the many relay hops that may make up a route from client to resource, only the last hop creates the egress connection. |
| Link   | A secure network connection between a node and a client, relay, or other node. There is generally only one link between any two entities. A link serves as a tunnel through which streams can flow.                                                                               |
| Query  | A single client request to a resource, such as a SQL query. Long-running SSH, RDP, or Kubernetes interactive sessions count as queries.                                                                                                                                           |
| Stream | A single logical network connection between a client and a resource. One stream can be tunneled through multiple links across multiple nodes. One link can contain multiple streams. There can be multiple simultaneous streams between a client and a resource.                  |

### Metrics

Node metrics are described in the following table.

| Metric name                                     | Metric type | Description                                                                                   | Label(s)                                                                                                                                                                                                                                         |
| ----------------------------------------------- | ----------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| go\_gc\_duration\_seconds                       | Summary     | Summary of the pause duration of garbage collection cycles                                    |                                                                                                                                                                                                                                                  |
| go\_goroutines                                  | Gauge       | Number of goroutines that currently exist                                                     |                                                                                                                                                                                                                                                  |
| go\_info                                        | Gauge       | Information about the Go environment                                                          |                                                                                                                                                                                                                                                  |
| go\_memstats\_alloc\_bytes                      | Gauge       | Number of bytes allocated and still in use                                                    |                                                                                                                                                                                                                                                  |
| go\_memstats\_alloc\_bytes\_total               | Counter     | Total number of bytes allocated even if freed                                                 |                                                                                                                                                                                                                                                  |
| go\_memstats\_buck\_hash\_sys\_bytes            | Gauge       | Number of bytes used by the profiling bucket hash table                                       |                                                                                                                                                                                                                                                  |
| go\_memstats\_frees\_total                      | Counter     | Total number of frees                                                                         |                                                                                                                                                                                                                                                  |
| go\_memstats\_gc\_sys\_bytes                    | Gauge       | Number of bytes used for garbage collection system metadata                                   |                                                                                                                                                                                                                                                  |
| go\_memstats\_heap\_alloc\_bytes                | Gauge       | Number of heap bytes allocated and still in use                                               |                                                                                                                                                                                                                                                  |
| go\_memstats\_heap\_idle\_bytes                 | Gauge       | Number of heap bytes waiting to be used                                                       |                                                                                                                                                                                                                                                  |
| go\_memstats\_heap\_inuse\_bytes                | Gauge       | Number of heap bytes that are in use                                                          |                                                                                                                                                                                                                                                  |
| go\_memstats\_heap\_objects                     | Gauge       | Number of allocated objects                                                                   |                                                                                                                                                                                                                                                  |
| go\_memstats\_heap\_released\_bytes             | Gauge       | Number of heap bytes released to OS                                                           |                                                                                                                                                                                                                                                  |
| go\_memstats\_heap\_sys\_bytes                  | Gauge       | Number of heap bytes obtained from the system                                                 |                                                                                                                                                                                                                                                  |
| go\_memstats\_last\_gc\_time\_seconds           | Gauge       | Number of seconds since 00:00:00 UTC on January 1, 1970 of the last garbage collection        |                                                                                                                                                                                                                                                  |
| go\_memstats\_lookups\_total                    | Counter     | Total number of pointer lookups                                                               |                                                                                                                                                                                                                                                  |
| go\_memstats\_mallocs\_total                    | Counter     | Total number of mallocs                                                                       |                                                                                                                                                                                                                                                  |
| go\_memstats\_mcache\_inuse\_bytes              | Gauge       | Number of bytes in use by mcache structures                                                   |                                                                                                                                                                                                                                                  |
| go\_memstats\_mcache\_sys\_bytes                | Gauge       | Number of bytes used for mcache structures obtained from the system                           |                                                                                                                                                                                                                                                  |
| go\_memstats\_mspan\_inuse\_bytes               | Gauge       | Number of bytes in use by mspan structures                                                    |                                                                                                                                                                                                                                                  |
| go\_memstats\_mspan\_sys\_bytes                 | Gauge       | Number of bytes used for mspan structures obtained from the system                            |                                                                                                                                                                                                                                                  |
| go\_memstats\_next\_gc\_bytes                   | Gauge       | Number of heap bytes when next garbage collection will take place                             |                                                                                                                                                                                                                                                  |
| go\_memstats\_other\_sys\_bytes                 | Gauge       | Number of bytes used for other system allocations                                             |                                                                                                                                                                                                                                                  |
| go\_memstats\_stack\_inuse\_bytes               | Gauge       | Number of bytes in use by the stack allocator                                                 |                                                                                                                                                                                                                                                  |
| go\_memstats\_stack\_sys\_bytes                 | Gauge       | Number of bytes obtained from the system for the stack allocator                              |                                                                                                                                                                                                                                                  |
| go\_memstats\_sys\_bytes                        | Gauge       | Number of bytes obtained from the system                                                      |                                                                                                                                                                                                                                                  |
| go\_threads                                     | Gauge       | Number of OS threads created                                                                  |                                                                                                                                                                                                                                                  |
| promhttp\_metric\_handler\_requests\_in\_flight | Gauge       | Current number of scrapes being served                                                        |                                                                                                                                                                                                                                                  |
| promhttp\_metric\_handler\_requests\_total      | Counter     | Total number of scrapes by HTTP status code                                                   |                                                                                                                                                                                                                                                  |
| sdmcli\_chunk\_completed\_count                 | Counter     | Number of chunks processed by the node                                                        | `type=<RESOURCE_TYPE>` (example: `type=postgres`)                                                                                                                                                                                                |
| sdmcli\_credential\_load\_count                 | Counter     | Total number of times the node has attempted to load credentials for a resource               | <p><code>type=\<RESOURCE\_TYPE></code> (example: <code>type=postgres</code>),<br><code>source=store</code></p>                                                                                                                                   |
| sdmcli\_egress\_count                           | Gauge       | Current number of active egress connections                                                   | `type=<RESOURCE_TYPE>` (example: `type=postgres`)                                                                                                                                                                                                |
| sdmcli\_egress\_attempt                         | Counter     | Total number of times the node has attempted to establish an egress connection to a resource  | <p><code>type=\<RESOURCE\_TYPE></code> (example: <code>type=postgres</code>),<br><code>successful=true</code></p>                                                                                                                                |
| sdmcli\_link\_attempt\_count                    | Counter     | Total number of attempts to establish links with other nodes and listeners                    | `direction=inbound`                                                                                                                                                                                                                              |
| sdmcli\_link\_count                             | Gauge       | Current number of active links                                                                |                                                                                                                                                                                                                                                  |
| sdmcli\_link\_latency                           | Gauge       | Round-trip network latency (in seconds) to a certain node                                     | <p><code>peer\_id=\<UUID\_OF\_GATEWAY></code>,<br><code>peer\_addr=\<HOST:PORT\_OF\_GATEWAY></code></p>                                                                                                                                          |
| sdmcli\_node\_heartbeat\_duration               | Histogram   | Count and duration of each time the node attempts to send a heartbeat to the StrongDM backend |                                                                                                                                                                                                                                                  |
| sdmcli\_node\_heartbeat\_error\_count           | Counter     | Total number of times a heartbeat attempt has failed                                          | `error=invalid operation\|permission denied\|item already exists\|item does not exist\|internal error\|canceled\|deadline exceeded\|unauthenticated\|failed precondition\|aborted\|out of range\|unimplemented\|unavailable\|resource exhausted` |
| sdmcli\_node\_lifecycle\_state\_change\_count   | Counter     | Total number of times the node has changed its lifecycle state                                | `state=verifying_restart\|awaiting_restart\|restarting\|started\|stopped`                                                                                                                                                                        |
| sdmcli\_query\_completed\_count                 | Counter     | Number of queries processed by the node                                                       | `type=<RESOURCE_TYPE>` (example: `type=postgres`)                                                                                                                                                                                                |
| sdmcli\_stream\_count                           | Gauge       | Current number of active streams                                                              |                                                                                                                                                                                                                                                  |
| sdmcli\_upload\_backlog\_bytes                  | Gauge       | Current size of the node's upload backlog in bytes                                            | `type=query_batch\|chunk`                                                                                                                                                                                                                        |
| sdmcli\_upload\_bytes                           | Counter     | Number of bytes the node has attempted to upload                                              | `type=query_batch`                                                                                                                                                                                                                               |
| sdmcli\_upload\_count                           | Counter     | Number of query batches and chunks the node has attempted to upload                           | `type=query_batch`                                                                                                                                                                                                                               |
| sdmcli\_upload\_dropped\_count                  | Counter     | Number of uploads the node has given up retrying                                              | `type=query_batch\|chunk`                                                                                                                                                                                                                        |
| sdmcli\_upload\_retried\_count                  | Counter     | Number of uploads the node has retried                                                        | `type=query_batch\|chunk`                                                                                                                                                                                                                        |

### Prerequisites

Before you begin configuration, ensure that you have the following:

* StrongDM client version 34.96.0 or higher
* A StrongDM account with the Administrator permission level
* A StrongDM node up and running
* Existing accounts and familiarity with the following:
  * A monitoring system and time series database, such as Prometheus
  * A monitoring dashboard, such as Grafana
  * An alerting tool, such as Prometheus Alertmanager or Rapid7

### Configuration Example

You can use the `/metrics` endpoint to request metrics for any monitoring solution. This particular example shows how to enable Prometheus-formatted metrics on a node. Note that the following example steps may differ from yours, and these steps are provided as an example only.

Configuration involves these general steps:

* Enable Prometheus-formatted metrics on your node
* Configure Prometheus
* Set up a monitoring dashboard
* Set up alerts

#### 1. Enable Prometheus-formatted metrics on your node

This section explains the various ways to enable Prometheus-formatted metrics on your node. You need to specify the port and/or IP address for the node to listen on. To do so, set an environment variable with or without IP, or pass a setting in your command-line interface.

Once metrics are enabled, the node starts listening on the specified port.

**Enable metrics using environment variable with port**

Set the `SDM_METRICS_LISTEN_ADDRESS` environment variable in the node's environment on port 9999:

```shell
SDM_METRICS_LISTEN_ADDRESS=:9999
```

**Enable metrics using environment variable with IP and port**

To specify an IP address to listen on, set the variable with the IP address and port 9999, as in the following example:

```shell
SDM_METRICS_LISTEN_ADDRESS=127.0.0.1:9999
```

**Enable metrics using CLI setting**

The following example shows how to pass the metrics setting as a command-line argument:

```shell
sdm relay --prometheus-metrics=:9999
```

#### 2. Configure Prometheus

1. Open your config YAML file for editing.
2. In the `scrape_configs` section, add jobs for each node, as in the following example:

   ```yaml
   scrape_configs:
     - job_name: "StrongDM Relay 01"
       static_configs:
         - targets: ["<RELAY_BOX_URL>:9999"]
   ```

#### 3. Set up your monitoring dashboard

Configure a monitoring dashboard such as Grafana to visualize your Prometheus metrics. For information on creating a Prometheus data source in Grafana, please see the [Prometheus documentation](https://prometheus.io/docs/visualization/grafana/).

#### 4. Set up alerts

Configure your desired alerts on a tool such as Prometheus Alertmanager or Rapid7 in order to ensure reliability and be aware of node performance issues.

You may, for example, wish to set alerts for node health, resource health and reachability, when a new node fails to connect, and when a connected node disconnects.

### How to Request Metrics

After configuration is complete, you can request metrics from the node on the specified port by accessing the `/metrics` endpoint.

For example:

```bash
curl http://127.0.0.1:9999/metrics
```
