Not to be confused with Vault tokens, Tokenization exchanges a sensitive value for an unrelated value called a token. The original sensitive value cannot be recovered from a token alone, they are irreversible. Instead, unlike format preserving encryption, tokenization is stateful. To decode the original value, the token must be submitted to Vault where it is retrieved from a cryptographic mapping in storage.
On encode, Vault generates a random, signed token and stores a mapping of a
version of that token to encrypted versions of the plaintext and metadata, as
well as a fingerprint of the original plaintext which facilitates the
endpoint that lets one query whether a plaintext exists in the system.
Depending on the mapping mode, the plaintext may be decoded only with posession of the distributed token, or may be recoverable in the export operation. See Security Considerations for more.
»Builtin (Internal) Store
As tokenization is stateful, the encode operation necessarily writes values to storage. By default, that storage is the Vault backend store itself. This differs from some secret engines in that the encode and decode operations require an access of storage per operation. Other engines use storage for configuration but can process operations largely without accessing any storage.
Since these operations involve writes to storage, and therefore must be performed on primary nodes, the scalability of the encode operation is limited by the primary's storage performance.
Additionally, using internal storage, since writes must be performed on primary nodes, the scalability of the encode operation will be limited by the performance of the primary and its storage subsystem. All other operations can be performed on secondaries.
Finally, due to replication, writes to the primary may take some time to reach secondaries, so other read operations like decode or metadata may not succeed on the secondaries until this happens. In other words, tokenization is eventually consistent.
All nodes (except DRs) can participate in all operations using external storage, but one must take care to monitor and scale the external storage for the level of traffic experienced. The storage schema is simple however and well known approaches should be effective.
The goal of Tokenization is to let end users' devices store the token rather than their sensitive values (such as credit card numbers) and still participate in transations where the token is a standin for the sensitive value. For this reason the token Vault generates is completely unrelated (e.g. irreversible) to the sensitive value.
Furthermore, the Tokenization transform is designed to resist a number of attacks on the values produced during encode. In particular it is designed so that attackers cannot recover plaintext even if they steal the tokenization values from Vault itself. In the default mapping mode, even stealing the underlying transform key does not allow them to recover the plaintext without also posessing the encoded token. An attacker must have gotten access to all values in the construct.
exportable mapping mode however, the plaintext values are encrypted
in a way that can be decrypted within Vault. If the attacker posesses the
transform key and the tokenization mapping values, the plaintext can be
recovered. This mode is available for the case where operators prioritize the
ability to export all of the plaintext values in an emergency, via the export
Since tokenization isn't format preserving and requires storage, one can associate arbitrary metadata with a token. Metadata is considered less sensitive than the original plaintext value. As it has it's own retrieval endpoint, operators can configure policies that may allow access to the metadata of a token but not its decoded value to enable workflows that operate just on the metadata.
»TTLs and Tidying
By default, tokens are long lived, and the storage for them will be maintained indefinitely. Where there is a concept of time-to-live, it is strongely encouraged that the tokens be generated with a TTL. For example, as credit cards have an expiration date, it is recommended that tokenizing a credit card primary account number (PAN) be done with a TTL that corresponds to the time after which the PAN is invalid.
This allows such values to be tidied and removed from storage once expired. Tokens themselves encode the expiration time, so decode and other operations can immediately reject the operation when presented with an expired token.
Currently only PostgreSQL is supported as an external storage backend for tokenization. The Schema Endpoint may be used to initialize and upgrade the necessary database tables. Vault uses a schema versioning table to determine if it needs to create or modify the tables when using that endpoint. If you make changes to those tables yourself, the automatic schema management may become out of sync and may fail in the future.
Refer to Tokenize Data with Transform Secrets Engine for a step-by-step tutorial.