Synthetic Data Generation
What is Synthetic Data?
Synthetic data is artificial data that can be created manually or generated automatically for a variety of use cases. It can be used for all forms of functional and non-functional testing, populating new data environments, or training and validating machine learning algorithms for AI applications.
Unlike test data sourced from a production environment, synthetic data is private and secure. No Personally Identifiable Information (PII) is contained in synthetic data. As a result, it has become extremely popular in quality assurance organizations for companies in regulated industries like healthcare and financial services.
Manually Created Synthetic Data
Synthetic data is often manually created in the form of spreadsheets and used to replace or augment production data with combinations and variations not present in the production dataset. Augmenting production data with synthetic data increases test coverage and can prevent costly software defects from escaping to the production environment.
On average, 67% of testers are using spreadsheets to manually generate their test data. This results in a bottleneck with testers spending 30% to 60% of their time provisioning test data. Manual data provisioning is labor intensive, limits the volume of data that can be produced, and doesn’t represent the multi-dimensional data relationships found in complex data environments.
What is Synthetic Data Generation?
Methods of Synthetic Data Generation
Synthetic data generation is much faster than manual data creation and can produce higher data volumes for load and performance testing. It’s an essential technology for reducing test cycle time and implementing shift-left testing strategies. However, not all synthetic data generation methods and technologies are the same. It’s important to understand the differences and their impact on test data quality.
Synthetic data generation systems operate in two fundamentally different ways:
- A synthetic data replica is produced by scanning and profiling a production data source
- Synthetic data is designed and generated dynamically to meet test case requirements
The first method is often associated with the use of deep learning to model and profile a production database to produce a statistically representative synthetic data replica. This results in a secure and private synthetic version of production data without its sensitive information. Then it can safely be used for data analytics and business intelligence use cases.
However, statistically representative synthetic data is not very suitable for software testing and quality assurance. The reason for this is simple: A synthetic replica has the same limitations as the original production database from which it was derived. If data patterns, permutations and variations are missing from the production database, they will also be missing from the synthetic replica.
Using a synthetic replica of production data for testing is a modern variation of the traditional Test Data Management (TDM) paradigm. When using a conventional TDM system, the QA team must copy, subset, mask, reserve, and refresh production data before it can be used for automated testing. In addition to the protracted provisioning time imposed by TDM systems, missing data variations reduce the overall test data quality and ultimately limit test coverage.
Using a synthetic data replica for software testing is just a faster and more automated version of the same TDM approach. The provisioning steps are streamlined, but the result is the same – missing data variations reduce test data quality and coverage.
Like its production data source, using a synthetic data replica for testing limits the ability to meet critical test case criteria:
- Injecting test data for both positive and negative testing
- Testing specific boundary conditions and edge cases
- Testing all data value combinations and permutations
- Testing new applications that have no data when introduced
- Testing complex workflows with dynamic data (see Data in Motion below)
Full test coverage requires both positive AND negative testing to fully validate the software under test. In addition, ALL combinations and permutations of data inputs must be tested to avoid costly defects in production code.
For a real-world example, read How a Single Bug Can Trigger a Massive Outage.
Real-time Synthetic Data Generation: By-Design and On-Demand
GenRocket uses a very different approach. Synthetic data is designed by software test engineers and developers based on their individual test case criteria using a self-service platform. Then synthetic data is generated on-demand and in real-time for each automated test run. This allows testers to have exactly the data they need at the moment they need it.
GenRocket’s Test Data Automation solution is based on four guiding principles, all intended to maximize the performance of its synthetic data platform:
Unparalleled control over data quality and output data formatting
Streamlined self-service using the fastest synthetic data architecture
Centralized management of projects with distributed data generation
The lowest TCO and the highest ROI of any enterprise test data platform
GenRocket uses a base template called a Project that directly relates to an application or database under test. Once a Project is created, it can be used to design synthetic data for multiple use cases based on the Agile methodology of grouping Scenarios into Stories and Stories into Epics. Self-service modules guide the tester through the design of Test Data Cases for any functional or non-functional test, to populate a new database, or to train and test a machine learning algorithm.
With GenRocket, Agile teams design the synthetic data needed to maximize coverage (or select a pre-designed Test Data Case) as they develop test cases during each sprint. The Test Data Design is then integrated with the test application and scheduled for automated execution in the CI/CD pipeline. GenRocket truly enables Quality at Speed.
How is Synthetic Data Used for Software Testing?
Data in Motion
Today’s applications are interconnected and data “flows” across systems via API’s and through data feeds that are in motion. Transactions are constantly happening and dynamically created. Synthetic data generation must do more than simulate static data stored in a replicated database. Test data must be dynamically inserted during a workflow in response to user and system interactions with a variety of dynamic data to validate outcomes and test error handling. GenRocket’s synthetic data generation platform generates data in real-time during test execution and uses intelligent API’s (e.g., REST API, JDBC, SOAP) to retrieve and insert data from within the testing application.
GenRocket’s self-service synthetic data generation platform provisions data by-design and on-demand to maximize coverage and accelerate cycle time. Here’s one customer’s experience with GenRocket’s self-service synthetic data generation:
GenRocket’s synthetic test data solution helped us increase our regression testing coverage from 30% to 80% and provided a wider range of code coverage paths for our systems under test. At the same time, we reduced our testing cycle time from 16 days down to 2 hours.
To learn more, read How Synthetic Data Delivers Quality at Speed in Financial Services.
Synthetic Data Streamlines the Test Data Lifecycle
Many organizations are now mandating the broad use of synthetic data for testing to ensure data privacy. By deploying Test Data Automation to generate real-time synthetic test data, many of the traditional test data management tasks described above are no longer necessary. Synthetically generated test data is a fundamentally different paradigm than sourcing data from a production environment. Here are some key differences:
- Synthetic data is 100% safe and secure, removing the risk of exposing any PII
- Data profiling and masking are not needed for 100% secure synthetic datasets
- Data volume and variety are pre-defined to provide the exact test data needed
- Data migration is not required because right sized test data sets are available to all testers
- Data reservation is not needed as unique data can be generated for every test case
- Data refresh is not needed because a fresh copy of data is generated for each test case
With GenRocket the test data lifecycle is transformed into a streamlined methodology that is both simple and elegant. It’s a 4-stage process.
- MODEL the data relationships for your target environment (there are many ways to do this)
- DESIGN the volume and variety of test data needed to meet all test case criteria
- DEPLOY the synthetic data you have designed into your automated test environment
- MANAGE, adapt and share your test data projects for maximum reusability
For a closer look at GenRocket’s streamlined process for self service provisioning, read the following article: A Streamlined Process for Self-Service Test Data Provisioning. It contains links to knowledgebase articles for a deeper and more technical understanding of how GenRocket enables each stage of the synthetic test data lifecycle.
How Does Data Quality Impact Synthetic Data?
The Critical Importance of Data Quality
The most important take-away from this introduction to the use of synthetic data generation for software testing is the issue of data quality. We believe there are 6 dimensions of quality that are essential for enterprise-class synthetic data generation. They are illustrated by the following diagram.
VARIETY: The ability to generate any pattern of data in a controlled fashion. If a tester wants a synthetic dataset in which every fourth row contains negative data (e.g., null value, invalid credit card number, illegal password character, incorrect computation, etc.), it should be quick and easy to design and generate the data on-demand.
ACCURACY: The ability to ensure data values follow business rules. If a claims processing application is being tested for the accurate use of diagnostic procedure codes, they must be accurately generated by the system, or queried from an external source and blended with a synthetic dataset in the appropriate output format.
CONSISTENCY: GenRocket uses intelligent automation to ensure all data generators and their configuration options are defined consistently across all synthetic data designs AND synthetic data is generated in a consistent manner across multiple systems under test for each test run.
REFERENTIAL INTEGRITY: GenRocket holds the only US patent for maintaining referential integrity across all data relationships no matter how many data tables are contained in a database modeled by the system or how complex their associations may be.
COMPLETENESS: To ensure full coverage, data must be complete with the ability to generate all combinations of permutations of potential data inputs in a controlled fashion. GenRocket’s PermutationGen data generator is designed expressly for this purpose.
VOLUME: The ability to generate the volume of synthetic data needed to stress test the code or simulate massive data volumes for IoT and Big Data applications. Through its partition engine and multi-threaded processes, GenRocket can generate millions, or even billions, of rows of data in a matter of minutes.
When these 6 dimensions of data quality are combined in a self-service synthetic data generation platform, any QA professional can maximize test coverage for any level of testing.
Learn more about How to Use Synthetic Data to Maximize Test Coverage.
If you would like to see the GenRocket Test Data Automation platform in action, request a demo with one of our synthetic data experts. We’ll be happy to answer your questions and explore ways that your organization can realize the full value of synthetic data generation.