»Telemetry

The Vault server process collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute.

To view the raw data, you must send a signal to the Vault process: on Unix-style operating systems, this is USR1 while on Windows it is BREAK. When the Vault process receives this signal it will dump the current telemetry information to the process's stderr.

This telemetry information can be used for debugging or otherwise getting a better view of what Vault is doing.

Telemetry information can also be streamed directly from Vault to a range of metrics aggregation solutions as described in the telemetry Stanza documentation.

The following is an example telemetry dump snippet:

[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.expire.num_leases': 5100.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.num_goroutines': 39.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.sys_bytes': 222746880.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.malloc_count': 109189192.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.free_count': 108408240.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.heap_objects': 780953.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_runs': 232.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.alloc_bytes': 72954392.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_pause_ns': 150293024.000
[2017-12-19 20:37:50 +0000 UTC][S] 'vault.merkle.flushDirty': Count: 100 Min: 0.008 Mean: 0.027 Max: 0.183 Stddev: 0.024 Sum: 2.681 LastUpdated: 2017-12-19 20:37:59.848733035 +0000 UTC m=+10463.692105920
[2017-12-19 20:37:50 +0000 UTC][S] 'vault.merkle.saveCheckpoint': Count: 4 Min: 0.021 Mean: 0.054 Max: 0.110 Stddev: 0.039 Sum: 0.217 LastUpdated: 2017-12-19 20:37:57.048458148 +0000 UTC m=+10460.891835029
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.alloc_bytes': 73326136.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.sys_bytes': 222746880.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.malloc_count': 109195904.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.free_count': 108409568.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.heap_objects': 786342.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_pause_ns': 150293024.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.expire.num_leases': 5100.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.num_goroutines': 39.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_runs': 232.000
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.route.rollback.consul-': Count: 1 Sum: 0.013 LastUpdated: 2017-12-19 20:38:01.968471579 +0000 UTC m=+10465.811842067
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.consul-': Count: 1 Sum: 0.073 LastUpdated: 2017-12-19 20:38:01.968502743 +0000 UTC m=+10465.811873131
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.pki-': Count: 1 Sum: 0.070 LastUpdated: 2017-12-19 20:38:01.96867005 +0000 UTC m=+10465.812041936
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.route.rollback.auth-app-id-': Count: 1 Sum: 0.012 LastUpdated: 2017-12-19 20:38:01.969146401 +0000 UTC m=+10465.812516689
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.identity-': Count: 1 Sum: 0.063 LastUpdated: 2017-12-19 20:38:01.968029888 +0000 UTC m=+10465.811400276
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.database-': Count: 1 Sum: 0.066 LastUpdated: 2017-12-19 20:38:01.969394215 +0000 UTC m=+10465.812764603
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.barrier.get': Count: 16 Min: 0.010 Mean: 0.015 Max: 0.031 Stddev: 0.005 Sum: 0.237 LastUpdated: 2017-12-19 20:38:01.983268118 +0000 UTC m=+10465.826637008
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.merkle.flushDirty': Count: 100 Min: 0.006 Mean: 0.024 Max: 0.098 Stddev: 0.019 Sum: 2.386 LastUpdated: 2017-12-19 20:38:09.848158309 +0000 UTC m=+10473.691527099

You'll note that log entries are prefixed with the metric type as follows:

  • [C] is a counter
  • [G] is a gauge
  • [S] is a summary

The following sections describe available Vault metrics. The metrics interval can be assumed to be 10 seconds when manually triggering metrics output using the above described signals.

»Audit Metrics

These metrics relate to auditing.

MetricDescriptionUnitType
vault.audit.log_requestDuration of time taken by all audit log requests across all audit log devicesmssummary
vault.audit.log_responseDuration of time taken by audit log responses across all audit log devicesmssummary
vault.audit.log_request_failureNumber of audit log request failures. NOTE: This is a particularly important metric. Any non-zero value here indicates that there was a failure to make an audit log request to any of the configured audit log devices; when Vault cannot log to any of the configured audit log devices it ceases all user operations, and you should begin troubleshooting the audit log devices immediately if this metric continually increases.failurescounter
vault.audit.log_response_failureNumber of audit log response failures. NOTE: This is a particularly important metric. Any non-zero value here indicates that there was a failure to receive a response to a request made to one of the configured audit log devices; when Vault cannot log to any of the configured audit log devices it ceases all user operations, and you should begin troubleshooting the audit log devices immediately if this metric continually increases.failurescounter

NOTE: In addition, there are audit metrics for each enabled audit device represented as vault.audit.<type>.log_request. For example, if a file audit device is enabled, its metrics would be vault.audit.file.log_request and vault.audit.file.log_response .

»Core Metrics

These metrics represent operational aspects of the running Vault instance.

MetricDescriptionUnitType
vault.barrier.deleteDuration of time taken by DELETE operations at the barriermssummary
vault.barrier.getDuration of time taken by GET operations at the barriermssummary
vault.barrier.putDuration of time taken by PUT operations at the barriermssummary
vault.barrier.listDuration of time taken by LIST operations at the barriermssummary
vault.core.check_tokenDuration of time taken by token checks handled by Vault coremssummary
vault.core.fetch_acl_and_tokenDuration of time taken by ACL and corresponding token entry fetches handled by Vault coremssummary
vault.core.handle_requestDuration of time taken by requests handled by Vault coremssummary
vault.core.handle_login_requestDuration of time taken by login requests handled by Vault coremssummary
vault.core.leadership_setup_failedDuration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status.mssummary
vault.core.leadership_lostDuration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status.mssummary
vault.core.post_unsealDuration of time taken by post-unseal operations handled by Vault coremsgauge
vault.core.pre_sealDuration of time taken by pre-seal operationsmsgauge
vault.core.seal-with-requestDuration of time taken by requested seal operationsmsgauge
vault.core.sealDuration of time taken by seal operationsmsgauge
vault.core.seal-internalDuration of time taken by internal seal operationsmsgauge
vault.core.step_downDuration of time taken by cluster leadership step downs. This should be monitored and alerted on for overall cluster leadership status.mssummary
vault.core.unsealDuration of time taken by unseal operationsmssummary

»Runtime Metrics

These metrics represent runtime aspects of the running Vault instance.

MetricDescriptionUnitType
vault.runtime.alloc_bytesNumber of bytes allocated by the Vault process. This could burst from time to time, but should return to a steady state value.bytesgauge
vault.runtime.free_countNumber of freed objectsobjectsgauge
vault.runtime.heap_objectsNumber of objects on the heap. This is a good general memory pressure indicator worth establishing a baseline and thresholds for alerting.objectsgauge
vault.runtime.malloc_countCumulative count of allocated heap objectsobjectsgauge
vault.runtime.num_goroutinesNumber of goroutines. This serves as a general system load indicator worth establishing a baseline and thresholds for alerting.goroutinesgauge
vault.runtime.sys_bytesNumber of bytes allocated to Vault. This includes what is being used by Vault's heap and what has been reclaimed but not given back to the operating system.bytesgauge
vault.runtime.total_gc_pause_nsThe total garbage collector pause time since Vault was last startednsgauge
vault.runtime.gc_pause_nsTotal duration of the last garbage collection runnssample
vault.runtime.total_gc_runsTotal number of garbage collection runs since Vault was last startedoperationsgauge

»Policy and Token Metrics

These metrics relate to policies and tokens.

MetricDescriptionUnitType
vault.expire.fetch-lease-timesTime taken to fetch lease timesmssummary
vault.expire.fetch-lease-times-by-tokenTime taken to fetch lease times by tokenmssummary
vault.expire.num_leasesNumber of all leases which are eligible for eventual expiryleasesgauge
vault.expire.revokeTime taken to revoke a tokenmssummary
vault.expire.revoke-forceTime taken to forcibly revoke a tokenmssummary
vault.expire.revoke-prefixTime taken to revoke tokens on a prefixmssummary
vault.expire.revoke-by-tokenTime taken to revoke all secrets issued with a given tokenmssummary
vault.expire.renewTime taken to renew a leasemssummary
vault.expire.renew-tokenTime taken to renew a token which does not need to invoke a logical backendmssummary
vault.expire.registerTime taken for register operationsmssummary

These operations take a request and response with an associated lease and register a lease entry with lease ID

MetricDescriptionUnitType
vault.expire.register-authTime taken for register authentication operations which create lease entries without lease IDmssummary
vault.policy.get_policyTime taken to get a policymssummary
vault.policy.list_policiesTime taken to list policiesmssummary
vault.policy.delete_policyTime taken to delete a policymssummary
vault.policy.set_policyTime taken to set a policymssummary
vault.token.createThe time taken to create a tokenmssummary
vault.token.create_rootNumber of created root tokens. Does not decrease on revocation.tokencounter
vault.token.createAccessorThe time taken to create a token accessormssummary
vault.token.lookupThe time taken to look up a tokenmssummary
vault.token.revokeTime taken to revoke a tokenmssummary
vault.token.revoke-treeTime taken to revoke a token treemssummary
vault.token.storeTime taken to store an updated token entry without writing to the secondary indexmssummary

»Auth Methods Metrics

These metrics relate to supported authentication methods.

MetricDescriptionUnitType
vault.rollback.attempt.auth-tokenTime taken to perform a rollback operation for the token auth methodmssummary
vault.rollback.attempt.auth-ldapTime taken to perform a rollback operation for the LDAP auth methodmssummary
vault.rollback.attempt.cubbyholeTime taken to perform a rollback operation for the Cubbyhole secret backendmssummary
vault.rollback.attempt.secretTime taken to perform a rollback operation for the K/V secret backendmssummary
vault.rollback.attempt.sysTime taken to perform a rollback operation for the system backendmssummary
vault.route.rollback.auth-ldapTime taken to perform a route rollback operation for the LDAP auth methodmssummary
vault.route.rollback.auth-tokenTime taken to perform a route rollback operation for the token auth methodmssummary
vault.route.rollback.cubbyholeTime taken to perform a route rollback operation for the Cubbyhole secret backendmssummary
vault.route.rollback.secretTime taken to perform a route rollback operation for the K/V secret backendmssummary
vault.route.rollback.sysTime taken to perform a route rollback operation for the system backendmssummary

»Merkle Tree and Write Ahead Log Metrics

These metrics relate to internal operations on Merkle Trees and Write Ahead Logs (WAL)

MetricDescriptionUnitType
vault.merkle_flushdirtyTime taken to flush any dirty pages to cold storagemssummary
vault.merkle_savecheckpointTime taken to save the checkpointmssummary
vault.wal_deletewalsTime taken to delete a Write Ahead Log (WAL)mssummary
vault.wal_gc_deletedNumber of Write Ahead Logs (WAL) deleted during each garbage collection runWALcounter
vault.wal_gc_totalTotal Number of Write Ahead Logs (WAL) on diskWALcounter
vault.wal_loadWALTime taken to load a Write Ahead Log (WAL)mssummary
vault.wal_persistwalsTime taken to persist a Write Ahead Log (WAL)mssummary
vault.wal_flushreadyTime taken to flush a ready Write Ahead Log (WAL) to storagemssummary

»Replication Metrics

These metrics relate to Vault Enterprise Replication. The following metrics are not available in telemetry unless replication is in an unhealthy state: replication.fetchRemoteKeys, replication.merkleDiff, and replication.merkleSync.

MetricDescriptionUnitType
logshipper.streamWALs.missing_guardNumber of incidences where the starting Merkle Tree index used to begin streaming WAL entries is not matched/foundmissing guardscounter
logshipper.streamWALs.guard_foundNumber of incidences where the starting Merkle Tree index used to begin streaming WAL entries is matched/foundfound guardscounter
replication.fetchRemoteKeysTime taken to fetch keys from a remote cluster participating in replication prior to Merkle Tree based delta generationmssummary
replication.merkleDiffTime taken to perform a Merkle Tree based delta generation between the clusters participating in replicationmssummary
replication.merkleSyncTime taken to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replicationmssummary
replication.merkle.commit_indexThe last committed index in the Merkle Tree.sequence numbergauge
replication.wal.last_walThe index of the last WALsequence numbergauge
replication.wal.last_dr_walThe index of the last DR WALsequence numbergauge
replication.wal.last_performance_walThe index of the last Performance WALsequence numbergauge
replication.fsm.last_remote_walThe index of the last remote WALsequence numbergauge
vault.replication.wal.gcTime taken to complete one run of the WAL garbage collection processmssummary
replication.rpc.server.auth_requestDuration of time taken by auth requestmssummary
replication.rpc.server.bootstrap_requestDuration of time taken by bootstrap requestmssummary
replication.rpc.server.conflicting_pages_requestDuration of time taken by conflicting pages requestmssummary
replication.rpc.server.echoDuration of time taken by echomssummary
replication.rpc.server.forwarding_requestDuration of time taken by forwarding requestmssummary
replication.rpc.server.guard_hash_requestDuration of time taken by guard hash requestmssummary
replication.rpc.server.persist_alias_requestDuration of time taken by persist alias requestmssummary
replication.rpc.server.persist_persona_requestDuration of time taken by persist persona requestmssummary
replication.rpc.server.stream_wals_requestDuration of time taken by stream wals requestmssummary
replication.rpc.server.sub_page_hashes_requestDuration of time taken by sub page hashes requestmssummary
replication.rpc.server.sync_counter_requestDuration of time taken by sync counter requestmssummary
replication.rpc.server.upsert_group_requestDuration of time taken by upsert group requestmssummary
replication.rpc.client.conflicting_pagesDuration of time taken by client conflicting pages requestmssummary
replication.rpc.client.fetch_keysDuration of time taken by client fetch keys requestmssummary
replication.rpc.client.forwardDuration of time taken by client forward requestmssummary
replication.rpc.client.guard_hashDuration of time taken by client guard hash requestmssummary
replication.rpc.client.persist_aliasDuration of time taken bymssummary
replication.rpc.client.register_authDuration of time taken by client register auth requestmssummary
replication.rpc.client.register_leaseDuration of time taken by client register lease requestmssummary
replication.rpc.client.stream_walsDuration of time taken by client smssummary
replication.rpc.client.sub_page_hashesDuration of time taken by client sub page hashes requestmssummary
replication.rpc.client.sync_counterDuration of time taken by client sync counter requestmssummary
replication.rpc.client.upsert_groupDuration of time taken by client upstert group requestmssummary
replication.rpc.client.wrap_in_cubbyholeDuration of time taken by client wrap in cubbyhole requestmssummary
replication.rpc.dr.server.echoDuration of time taken by DR echo requestmssummary
replication.rpc.dr.server.fetch_keys_requestDuration of time taken by DR fetch keys requestmssummary
replication.rpc.standby.server.echoDuration of time taken by standby echo requestmssummary
replication.rpc.standby.server.register_auth_requestDuration of time taken by standby register auth requestmssummary
replication.rpc.standby.server.register_lease_requestDuration of time taken by standby register lease requestmssummary
replication.rpc.standby.server.wrap_token_requestDuration of time taken by standby wrap token requestmssummary

»Secrets Engines Metrics

These metrics relate to the supported secrets engines.

MetricDescriptionUnitType
database.InitializeTime taken to initialize a database secret engine across all database secrets enginesmssummary
database.<name>.InitializeTime taken to initialize a database secret engine for the named database secrets engine <name>, for example: database.postgresql-prod.Initializemssummary
database.Initialize.errorNumber of database secrets engine initialization operation errors across all database secrets engineserrorscounter
database.<name>.Initialize.errorNumber of database secrets engine initialization operation errors for the named database secrets engine <name>, for example: database.postgresql-prod.Initialize.errorerrorscounter
database.CloseTime taken to close a database secret engine across all database secrets enginesmssummary
database.<name>.CloseTime taken to close a database secret engine for the named database secrets engine <name>, for example: database.postgresql-prod.Closemssummary
database.Close.errorNumber of database secrets engine close operation errors across all database secrets engineserrorscounter
database.<name>.Close.errorNumber of database secrets engine close operation errors for the named database secrets engine <name>, for example: database.postgresql-prod.Close.errorerrorscounter
database.CreateUserTime taken to create a user across all database secrets enginesmssummary
database.<name>.CreateUserTime taken to create a user for the named database secrets engine <name>mssummary
database.CreateUser.errorNumber of user creation operation errors across all database secrets engineserrorscounter
database.<name>.CreateUser.errorNumber of user creation operation errors for the named database secrets engine <name>, for example: database.postgresql-prod.CreateUser.errorerrorscounter
database.RenewUserTime taken to renew a user across all database secrets enginesmssummary
database.<name>.RenewUserTime taken to renew a user for the named database secrets engine <name>, for example: database.postgresql-prod.RenewUsermssummary
database.RenewUser.errorNumber of user renewal operation errors across all database secrets engineserrorscounter
database.<name>.RenewUser.errorNumber of user renewal operations for the named database secrets engine <name>, for example: database.postgresql-prod.RenewUser.errorerrorscounter
database.RevokeUserTime taken to revoke a user across all database secrets enginesmssummary
database.<name>.RevokeUserTime taken to revoke a user for the named database secrets engine <name>, for example: database.postgresql-prod.RevokeUsermssummary
database.RevokeUser.errorNumber of user revocation operation errors across all database secrets engineserrorscounter
database.<name>.RevokeUser.errorNumber of user revocation operations for the named database secrets engine <name>, for example: database.postgresql-prod.RevokeUser.errorerrorscounter

»Storage Backend Metrics

These metrics relate to the supported storage backends.

MetricDescriptionUnitType
vault.azure.putDuration of a PUT operation against the Azure storage backendmssummary
vault.azure.getDuration of a GET operation against the Azure storage backendmssummary
vault.azure.deleteDuration of a DELETE operation against the Azure storage backendmssummary
vault.azure.listDuration of a LIST operation against the Azure storage backendmssummary
vault.cassandra.putDuration of a PUT operation against the Cassandra storage backendmssummary
vault.cassandra.getDuration of a GET operation against the Cassandra storage backendmssummary
vault.cassandra.deleteDuration of a DELETE operation against the Cassandra storage backendmssummary
vault.cassandra.listDuration of a LIST operation against the Cassandra storage backendmssummary
vault.cockroachdb.putDuration of a PUT operation against the CockroachDB storage backendmssummary
vault.cockroachdb.getDuration of a GET operation against the CockroachDB storage backendmssummary
vault.cockroachdb.deleteDuration of a DELETE operation against the CockroachDB storage backendmssummary
vault.cockroachdb.listDuration of a LIST operation against the CockroachDB storage backendmssummary
vault.consul.putDuration of a PUT operation against the Consul storage backendmssummary
vault.consul.getDuration of a GET operation against the Consul storage backendmssummary
vault.consul.deleteDuration of a DELETE operation against the Consul storage backendmssummary
vault.consul.listDuration of a LIST operation against the Consul storage backendmssummary
vault.couchdb.putDuration of a PUT operation against the CouchDB storage backendmssummary
vault.couchdb.getDuration of a GET operation against the CouchDB storage backendmssummary
vault.couchdb.deleteDuration of a DELETE operation against the CouchDB storage backendmssummary
vault.couchdb.listDuration of a LIST operation against the CouchDB storage backendmssummary
vault.dynamodb.putDuration of a PUT operation against the DynamoDB storage backendmssummary
vault.dynamodb.getDuration of a GET operation against the DynamoDB storage backendmssummary
vault.dynamodb.deleteDuration of a DELETE operation against the DynamoDB storage backendmssummary
vault.dynamodb.listDuration of a LIST operation against the DynamoDB storage backendmssummary
vault.etcd.putDuration of a PUT operation against the etcd storage backendmssummary
vault.etcd.getDuration of a GET operation against the etcd storage backendmssummary
vault.etcd.deleteDuration of a DELETE operation against the etcd storage backendmssummary
vault.etcd.listDuration of a LIST operation against the etcd storage backendmssummary
vault.gcs.putDuration of a PUT operation against the Google Cloud Storage storage backendmssummary
vault.gcs.getDuration of a GET operation against the Google Cloud Storage storage backendmssummary
vault.gcs.deleteDuration of a DELETE operation against the Google Cloud Storage storage backendmssummary
vault.gcs.listDuration of a LIST operation against the Google Cloud Storage storage backendmssummary
vault.gcs.lock.unlockDuration of an UNLOCK operation against the Google Cloud Storage storage backend in HA modemssummary
vault.gcs.lock.lockDuration of a LOCK operation against the Google Cloud Storage storage backend in HA modemssummary
vault.gcs.lock.valueDuration of a VALUE operation against the Google Cloud Storage storage backend in HA modemssummary
vault.mssql.putDuration of a PUT operation against the MS-SQL storage backendmssummary
vault.mssql.getDuration of a GET operation against the MS-SQL storage backendmssummary
vault.mssql.deleteDuration of a DELETE operation against the MS-SQL storage backendmssummary
vault.mssql.listDuration of a LIST operation against the MS-SQL storage backendmssummary
vault.mysql.putDuration of a PUT operation against the MySQL storage backendmssummary
vault.mysql.getDuration of a GET operation against the MySQL storage backendmssummary
vault.mysql.deleteDuration of a DELETE operation against the MySQL storage backendmssummary
vault.mysql.listDuration of a LIST operation against the MySQL storage backendmssummary
vault.postgres.putDuration of a PUT operation against the PostgreSQL storage backendmssummary
vault.postgres.getDuration of a GET operation against the PostgreSQL storage backendmssummary
vault.postgres.deleteDuration of a DELETE operation against the PostgreSQL storage backendmssummary
vault.postgres.listDuration of a LIST operation against the PostgreSQL storage backendmssummary
vault.s3.putDuration of a PUT operation against the Amazon S3 storage backendmssummary
vault.s3.getDuration of a GET operation against the Amazon S3 storage backendmssummary
vault.s3.deleteDuration of a DELETE operation against the Amazon S3 storage backendmssummary
vault.s3.listDuration of a LIST operation against the Amazon S3 storage backendmssummary
vault.spanner.putDuration of a PUT operation against the Google Cloud Spanner storage backendmssummary
vault.spanner.getDuration of a GET operation against the Google Cloud Spanner storage backendmssummary
vault.spanner.deleteDuration of a DELETE operation against the Google Cloud Spanner storage backendmssummary
vault.spanner.listDuration of a LIST operation against the Google Cloud Spanner storage backendmssummary
vault.spanner.lock.unlockDuration of an UNLOCK operation against the Google Cloud Spanner storage backend in HA modemssummary
vault.spanner.lock.lockDuration of a LOCK operation against the Google Cloud Spanner storage backend in HA modemssummary
vault.spanner.lock.valueDuration of a VALUE operation against the Google Cloud Spanner storage backend in HA modemssummary
vault.swift.putDuration of a PUT operation against the Swift storage backendmssummary
vault.swift.getDuration of a GET operation against the Swift storage backendmssummary
vault.swift.deleteDuration of a DELETE operation against the Swift storage backendmssummary
vault.swift.listDuration of a LIST operation against the Swift storage backendmssummary
vault.zookeeper.putDuration of a PUT operation against the ZooKeeper storage backendmssummary
vault.zookeeper.getDuration of a GET operation against the ZooKeeper storage backendmssummary
vault.zookeeper.deleteDuration of a DELETE operation against the ZooKeeper storage backendmssummary
vault.zookeeper.listDuration of a LIST operation against the ZooKeeper storage backendmssummary

»Integrated Raft Storage Health

These metrics relate to raft based integrated storage.

MetricDescriptionUnitType
vault.raft.applyNumber of Raft transactions occurring over the interval, which is a general indicator of the write load on the Raft servers.raft transactions / intervalcounter
vault.raft.barrierNumber of times the node has started the barrier i.e the number of times it has issued a blocking call, to ensure that the node has all the pending operations that were queued, to be applied to the node's FSM.blocks / intervalcounter
vault.raft.candidate.electSelfTime to request for a vote from a peer.mssummary
vault.raft.commitNumLogsNumber of logs processed for application to the FSM in a single batch.logsgauge
vault.raft.commitTimeTime to commit a new entry to the Raft log on the leader.mstimer
vault.raft.compactLogsTime to trim the logs that are no longer needed.mssummary
vault.raft.deleteTime to delete file from raft's underlying storage.mssummary
vault.raft.delete_prefixTime to delete files under a prefix from raft's underlying storage.mssummary
vault.raft.fsm.applyNumber of logs committed since the last interval.commit logs / intervalsummary
vault.raft.fsm.applyBatchTime to apply batch of logs.mssummary
vault.raft.fsm.applyBatchNumNumber of logs applied in batch.mssummary
vault.raft.fsm.enqueueTime to enqueue a batch of logs for the FSM to apply.mstimer
vault.raft.fsm.restoreTime taken by the FSM to restore its state from a snapshot.mssummary
vault.raft.fsm.snapshotTime taken by the FSM to record the current state for the snapshot.mssummary
vault.raft.fsm.store_configTime to store the configuration.mssummary
vault.raft.getTime to retrieve file from raft's underlying storage.mssummary
vault.raft.leader.dispatchLogTime for the leader to write log entries to disk.mstimer
vault.raft.leader.dispatchNumLogsNumber of logs committed to disk in a batch.logsgauge
vault.raft.listTime to retrieve list of keys from raft's underlying storage.mssummary
vault.raft.putTime to persist key in raft's underlying storage.mssummary
vault.raft.replication.appendEntries.logNumber of logs replicated to a node, to bring it up to speed with the leader's logs.logs appended / intervalcounter
vault.raft.replication.appendEntries.rpcTime taken by the append entries RFC, to replicate the log entries of a leader node onto its follower node(s).mstimer
vault.raft.replication.heartbeatTime taken to invoke appendEntries on a peer, so that it doesn’t timeout on a periodic basis.mstimer
vault.raft.replication.installSnapshotTime taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state.mstimer
vault.raft.restoreNumber of times the restore operation has been performed by the node. Here, restore refers to the action of raft consuming an external snapshot to restore its state.operation invoked / intervalcounter
vault.raft.restoreUserSnapshotTime taken by the node to restore the FSM state from a user's snapshot.mstimer
vault.raft.rpc.appendEntriesTime taken to process an append entries RPC call from a node.mstimer
vault.raft.rpc.appendEntries.processLogsTime taken to process the outstanding log entries of a node.mstimer
vault.raft.rpc.appendEntries.storeLogsTime taken to add any outstanding logs for a node, since the last appendEntries was invoked.mstimer
vault.raft.rpc.installSnapshotTime taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state.mstimer
vault.raft.rpc.processHeartbeatTime taken to process a heartbeat request.mstimer
vault.raft.rpc.requestVoteTime taken to complete requestVote RPC call.mssummary
vault.raft.snapshot.createTime taken to initialize the snapshot process.mstimer
vault.raft.snapshot.persistTime taken to dump the current snapshot taken by the node to the disk.mstimer
vault.raft.snapshot.takeSnapshotTotal time involved in taking the current snapshot (creating one and persisting it) by the node.mstimer
vault.raft.state.followerNumber of times node has entered the follower mode. This happens when a new node joins the cluster or after the end of a leader election.follower state entered / intervalcounter
vault.raft.transition.heartbeat_timeoutNumber of times node has transitioned to the Candidate state, after receive no heartbeat messages from the last known leader.timeouts / intervalcounter
vault.raft.transition.leader_lease_timeoutNumber of times quorum of nodes were not able to be contacted.contact failurescounter
vault.raft.verify_leaderNumber of times node checks whether it is still the leader or not.checks / intervalcounter
vault.raft-storage.deleteTime to insert log entry to delete path.mstimer
vault.raft-storage.getTime to retrieve value for path from FSM.mstimer
vault.raft-storage.putTime to insert log entry to persist path.mstimer
vault.raft-storage.listTime to list all entries under the prefix from the FSM.mstimer
vault.raft-storage.transactionTime to insert operations into a single log.mstimer

»Integrated Raft Storage Leadership Changes

MetricDescriptionUnitType
vault.raft.leader.lastContactMeasures the time since the leader was last able to contact the follower nodes when checking its leader leasemssummary
vault.raft.state.candidateIncrements whenever raft server starts an electionElectionscounter
vault.raft.state.leaderIncrements whenever raft server becomes a leaderLeaderscounter

Why they're important: Normally, your raft cluster should have a stable leader. If there are frequent elections or leadership changes, it would likely indicate network issues between the raft nodes, or that the raft servers themselves are unable to keep up with the load.

What to look for: For a healthy cluster, you're looking for a lastContact lower than 200ms, leader > 0 and candidate == 0. Deviations from this might indicate flapping leadership.