Intelligent Data Subsetting and Synthetic Data Masking.

Even as the use of synthetic data becomes more prevalent in quality engineering organizations, most dev and test teams are accustomed to using masked production data for testing. To meet this requirement, GenRocket offers a versatile solution for Intelligent Data Subsetting and Synthetic Data Masking. And to extend the variety and volume of data available in a given subset, GenRocket makes it easy to augment production data with controlled and conditioned synthetic data as a blended dataset.

GenRocket’s Intelligent Data Subsetting and Synthetic Data Masking capabilities offer a “best of both worlds” approach for integrating the use of synthetic data generation with production data subsetting and masking.

The chart below illustrates the versatility of the GenRocket platform.

GenRocket’s Versatile Data Subsetting and Masking Solution

Intelligent Data Subsetting

With Intelligent Data Subsetting, production data can be queried, and one or more subsets of an SQL production database can be provisioned for testing on-demand and in real-time. It’s an efficient way to provision test data in a fraction of the time required by traditional TDM systems.

Intelligent Data Subsetting extracts a portion of a source production database and moves it to a target dataset for automated testing in a CI/CD pipeline. A data subset can represent all data tables in the source database or a specific family of related data tables.

With GenRocket’s data subsetting features, data can be subsetted based on various parameters and the subsetted data will maintain referential integrity:

Subset based on a “where” clause defining specific values
Percentage of rows in the source database
Number of rows in the source database
Subset across different schemas, while holding referential integrity

Intelligent Data Subsetting is a powerful capability that allows testers to reduce the size of a massive database down to a manageable data subset, making it faster and easier for testers to find the data needed for test cases. For example, a database containing millions of transactions for customers in all 50 United States can be subsetted to contain transactions from just one or two states. The size of the data subset will be a small percentage of the source database while providing a consistent representation with the same referentially intact data tables.

Data Subsetting at Speed and Scale

An internal benchmark of GenRocket’s data subsetting performance demonstrates the power and speed of the solution. In one use case, a Microsoft SQL database is being subsetted to half its original size on a local machine and performance is measured under controlled conditions.

With GenRocket’s Intelligent Data Subsetting, the time required to provision a subset with 8 million records was less than 2 minutes.

Synthetic Data Masking

GenRocket can be used to dynamically mask sensitive data in a database using a high-speed synthetic data replacement process. We call it Synthetic Data Masking and it can be performed at the field-level for any popular SQL database and also at the file level for a wide variety of data file formats. Both processes are illustrated in the diagram below.

Synthetic Data Masking allows testers to select sensitive data elements in a production database and replace them with 100% secure synthetic data. This real-time synthetic data masking process protects sensitive production data and ensures compliance with all data privacy laws.

And with GenRocket’s Synthetic Data Masking solution, sensitive data (PII, PHI) is never touched, stored, or copied. Only the metadata containing the database structure is used to identify sensitive data elements and to assign data generators to replace them with realistic, controlled, and conditioned synthetic data.

GenRocket’s Synthetic Data Masking solution is a significant improvement over traditional data masking in multiple ways:

GenRocket never accesses sensitive production data (e.g., PII or PHI), only the metadata that contains the data model used for data generation.
GenRocket replaces sensitive production data with a realistic synthetic data equivalent that is 100% compliant with all data privacy laws (CPRA, GDPR, HIPAA, etc.).
Synthetic data can be generated in patterns, permutations, negative values, and edge cases to provide greater data variety and coverage.
GenRocket is much faster than conventional masking – Synthetic data can be generated to replace sensitive production data values in fractions of a second.
GenRocket’s Synthetic Data Masking requires no data storage, reservation, or refresh because a fresh copy of synthetic data is generated for each test run.

GenRocket can generate synthetic data in the following database types and file formats:

Supported Databases

Oracle
MS SQL Server
IBM DB2
PostgreSQL
MySQL

Supported File Types

Any delimited file (e.g., CSV)
Fixed file (e.g., VSAM)
X12 EDI
JSON
XML
ORC (Hadoop)

Combined Subsetting and Masking

When used together, Intelligent Data Subsetting and Synthetic Data Masking provide a powerful solution that accelerates the time to provision masked production data for testing. Using the GenRocket platform, data subsets with synthetically masked data can be defined as pre-built Test Data Cases and shared with distributed dev and test teams via self-service portal. This allows a fresh copy of test data to be rapidly provisioned for each test case and eliminates the need for data reservation and refresh.

Synthetic Data Augmentation

With GenRocket’s industry-leading synthetic data generation capabilities, dev and test teams can augment masked production data with synthetic data to provide a more comprehensive test dataset. This results in testing with greater coverage and also enables load and performance testing during the same test run. Simply import the data model for the required synthetic data, specify the range of values and the volume of data required, and generate the desired volume and variety of synthetic data in combination with the subsetting process.

Testing is now possible using both positive and negative data, edge case conditions, or any pattern or permutation of data values. Here are some examples:

All data combinations of account numbers, verification codes & expiration dates
Business logic that controls reward point values based on cumulative spending
Edge cases for validating credit approval based on transaction amount & credit limit
Scalable data volume to simulate a peak transaction load for performance testing

A Single Platform for Any Kind of Test Data

GenRocket’s Test Data Automation solution provides a single platform for enterprise-wide test data provisioning. It enables the design and delivery of synthetic data that meets the requirements of any automated test case. At the same time, it has the ability to copy, mask and subset production data with greater speed and flexibility than traditional TDM systems. This allows teams to combine synthetic and production data in the most effective ways to meet their needs.

Intelligent Data Subsetting and Synthetic Data Masking