The story behind Matters.AI funding journey

Data Masking

Data masking replaces real sensitive values with realistic fakes protecting production data in dev and test environments without breaking application workflows.

Read with AI

What is Data Masking?

Data masking is a technique for protecting sensitive data by replacing real values with fictitious but structurally realistic alternatives that preserve the data's format and usability while removing its sensitivity. A customer's actual name becomes a randomly generated name. A real credit card number becomes a syntactically valid but non-functional number. A genuine email address becomes a plausible but fake address.

The result looks and behaves like real data from the perspective of applications and workflows that process it. But it contains no actual sensitive information. An attacker who obtains a masked dataset has no useful data. A developer working in a test environment can run realistic application tests without ever touching production customer records.

Why data masking exists: the non-production data problem

Production databases contain real sensitive data: real customer PII, real payment card information, real health records, real financial details. Those databases carry strict access controls, monitoring, encryption, and audit requirements.

Development, testing, analytics, and training environments need data that behaves like production data for accurate testing and model development. But giving development teams direct access to production databases creates significant risk: broader access than necessary, weaker security controls in non-production environments, and real sensitive data in environments not designed to protect it.

This is the core problem data masking solves. It produces a version of the data that preserves the structural characteristics applications depend on correct data types, realistic value distributions, referential integrity between related tables without preserving the actual sensitive values that would create privacy and compliance exposure.

In regulated environments, this isn't optional. GDPR's data minimisation principle requires that personal data not be used beyond the purpose for which it was collected. Using production customer data in a development environment where it's necessary for the specific product feature being tested is a harder argument to make than using structurally equivalent but fictitious data. HIPAA's minimum necessary standard for PHI access creates equivalent pressure. Masking enables compliance with these principles while keeping development and testing workflows functional.

The main data masking techniques

Data masking is a category of methods, not a single technique. The right approach depends on the use case, the data type, and what properties of the original data must be preserved.

Substitution

Replacing a real value with a randomly selected value from a predefined library. A real name like "Sarah Johnson" gets substituted with a randomly selected name from a large dictionary of names. The result is a realistic name that has no relationship to any real individual. Substitution works well for names, addresses, and other text fields where structural realism matters but actual values don't.

Shuffling

Redistributing existing values within a column among different rows. A column of customer names gets shuffled so each name is still real but now associated with a different record. Shuffling preserves the statistical distribution of values in the column — if the dataset has a realistic distribution of common and uncommon names, it retains that distribution after shuffling. The weakness is that it preserves actual values from the dataset, which may be problematic if the masked dataset could be cross-referenced against another source to re-identify individuals.

Pseudonymisation

Replacing identifiable values with consistent pseudonyms that can be mapped back to the original using a separately stored key. The email address sarah.johnson@example.com always becomes the same pseudonym in the masked dataset, but the pseudonym can be reversed using the key. Pseudonymisation preserves referential integrity across tables — if a customer ID appears in multiple tables, the same pseudonym appears in all of them, preserving the relationships. GDPR treats pseudonymised data as personal data because it can be reversed, but acknowledges that pseudonymisation is a useful technical measure that reduces risk.

Data generation

Creating entirely synthetic data that has no relationship to any real record, generated algorithmically to match the statistical properties of the original dataset. Synthetic data generation is more complex than substitution but produces datasets that are guaranteed to contain no real personal information. It's increasingly used for machine learning model training where statistical realism matters more than field-by-field accuracy.

Nulling out

Replacing sensitive values with null or empty values. Simple to implement, but often breaks applications that require non-null values and doesn't produce the realistic data that development and testing workflows need.

Character scrambling.

Randomly reordering or replacing individual characters within a value. A credit card number 4532015112830366 might become 4578203152063136. Character scrambling preserves format and length but destroys meaning. It's simple and fast but less realistic than substitution or generation.

Number variance

Adding a random percentage adjustment to numerical values. A salary of £45,000 becomes £43,827. The distribution of the column remains statistically plausible while individual values are altered. Useful for financial data where the range and distribution of values matters for testing but exact values must not be exposed.

Static masking vs dynamic masking

These two deployment models serve different use cases and have meaningfully different architectures.

Static data masking transforms a copy of production data before it's loaded into a non-production environment. The masking process runs once when the copy is created. The resulting dataset is permanently masked: there is no way to recover the original values from the masked copy. Static masking is appropriate for test and development environments where the masked copy replaces production data and will be used repeatedly.

Dynamic data masking intercepts queries to production data in real time and masks the results before they reach the requester. The production data remains unchanged in storage. What changes is what different users see when they query it. A customer service agent with limited access might see ****-****-****-1234 where the credit card number would be. A financial analyst with full access sees the complete number. Dynamic masking is applied at the access layer through database proxies or application-level controls.

The choice between them depends on what's being protected and from whom. Static masking protects the entire non-production environment from exposure by ensuring production data never enters it. Dynamic masking protects production data from over-privileged access within the production environment itself. They solve different problems and often exist in the same organisation simultaneously.

Data masking vs tokenization vs encryption

These three techniques are all used for sensitive data protection but have different properties and different appropriate use cases.

Encryption transforms data into ciphertext that requires a key to decrypt. Encrypted data can be returned to its original form using the key. Encryption is appropriate for data at rest and in transit where the original values need to be recoverable. It doesn't solve the non-production data problem because decrypting data in a development environment exposes the original sensitive values.

Tokenization replaces sensitive values with non-sensitive tokens maintained in a separate lookup table. The original value is stored securely in a token vault, and the token replaces it everywhere else. Like pseudonymisation, tokenization is reversible by anyone with access to the vault. Unlike masking, tokenization preserves the ability to retrieve the original value. Payment card tokenization is the dominant use case: a real card number is replaced with a token in merchant systems, with only the payment processor maintaining the vault mapping tokens to actual card numbers.

Masking replaces sensitive values with fictitious values without maintaining a lookup that enables reversal. True masking (as opposed to pseudonymisation) is irreversible. The original value cannot be recovered from the masked value. This is appropriate for non-production environments where the original values are not needed.

The practical choice: encryption for protecting live production data that needs to be recovered, tokenization for use cases requiring both protection and the ability to retrieve the original value (payment processing, specific audit scenarios), masking for non-production environments where the original values are not needed and shouldn't be present.

Why masking accuracy depends on classification

Masking can only protect data that has been identified as sensitive. A masking programme that covers the columns specified in its configuration misses every column containing sensitive data that wasn't in the configuration.

This is why classification is the prerequisite for effective masking. Before a production database copy is created for a development environment, classification should have identified every column containing PII, PHI, PCI data, or other sensitive content. The masking job then operates against the complete set of sensitive columns, not a manually curated list that may be incomplete.

Rule-based classification that misses PII in columns with non-obvious names, or fails to detect sensitive data in free-text fields, produces masking coverage that appears complete but leaves sensitive data exposed in non-production environments. Semantic classification that evaluates column content and context produces coverage that reflects what the data actually contains.

Frequently asked questions

What is data masking?

Data masking is a technique for protecting sensitive data by replacing real values with fictitious but structurally realistic alternatives, preserving the data's format and usability while removing its sensitivity. It's used primarily for non-production environments — development, testing, analytics — where data needs to behave like production data without exposing real sensitive information.

What is the difference between data masking and tokenization?

Tokenization replaces sensitive values with tokens maintained in a separate vault that enables the original value to be retrieved. Masking replaces sensitive values with fictitious values without maintaining a mapping to the original. Tokenization is reversible; masking is not. Tokenization is appropriate for payment systems where the original card number must be retrievable. Masking is appropriate for non-production environments where the original value is not needed.

What is the difference between static and dynamic data masking?

Static data masking transforms a copy of production data before loading it into a non-production environment. Dynamic data masking intercepts queries to production data in real time and masks results based on the requester's access level. Static masking prevents production data from entering non-production environments. Dynamic masking controls what different users see when they query production data.

Does data masking satisfy GDPR?

Data masking reduces GDPR compliance risk for non-production use cases by preventing real personal data from appearing in development and testing environments. GDPR's data minimisation principle supports using masked data for development where real personal data isn't necessary. However, masking alone doesn't satisfy GDPR compliance: the production data being masked still requires full GDPR protections, and pseudonymisation (reversible masking) doesn't remove data from GDPR scope.

Published May 1, 2026
Share

Ready to see Matters in Action?

Join a specialized 30-minute walkthrough. No sales fluff, just pure visibility and security intelligence.