Partition Engine Overview

The Partition Engine is used to generate hundreds of millions, to billions, or even trillions of rows of test data in a short period of time. It takes a given Scenario and partitions the load in one of the following ways:

  1. Across multiple GenRocket instances on one server.

  2. Across multiple servers, with each running multiple GenRocket instances.

Note: Click here to see examples with diagrams of the above two ways of using the Partition Engine.

Important: Any user wanting to generate data using the Partition Engine must know what they are trying to do when generating data at this level. They must be knowledgeable of their Operating System, Memory, CPUs, Hard Drive, and amount of used space.

Requirements for Partitioning

To generate a huge to an enormous amount of test data across multiple GenRocket instances, the following rules are required:

  1. The volume of the test data generation must be evenly distributed across all instances.

  2. The values produced by sequential generation must be unique and be increasing across all instances.

How does the Partition Engine work?

For example, a user wants to generate 100 million rows of data. The GenRocket partition command can be used on a laptop or desktop to create 10 instances.

Based on a Configuration File, the Partition Engine will do the following:

  1. Load the Scenario
  2. Create ten instances of the Scenario
  3. With each instance, it will determine where the generator should start generating data from

All ten partitions will be generating data at the same time and writing the generated data to files. A Bulk Load Receiver (i.e., Partition Receiver) generates the data and puts the data into a bulk load format.

After all the test data is generated into files, the Bulk Load Receiver will start looking in the directories one at a time and load the information directly into the database, which uses its own engine to look at the files.

It will grab 10,000 files and efficiently load them into the database, starting with the first directory and data file. Then it will move to the next data file.

Below is a Single Server Component diagram example for the Partition Engine:

Note: For more information about Bulk Load Receivers, click here.

Configuration File

When there is a unique value such as an ID, Social Security Number, or Credit Card Number, it is important to make sure that all are unique. To do so, we must know where each one is going to start to generate the same unique values. This is the purpose of the configuration file.

The following example presents a payload configuration for generating 100 million users, to be evenly partitioned across 10 instances on 1 server.

Note: For more information about each Configuration File Parameter, click here.

Bulk Load Receivers

Bulk Load Receivers populate a large amount of data into data warehouses (Teradata, MongoDB, Cassandra, etc.) or unstructured databases at a faster speed than through JDBC. These Receivers generate the data to a given database’s native bulk load format.

The Bulk Load Receiver contains a filesPerDirectory and recordsPerFile Parameter. These Parameters define how many rows of data should be in each file and how many files should be in each subdirectory. The default is 10,000 rows per file and 100 files per subdirectory.

When the Partition Engine starts generating data, it will create the files and subdirectories based on the Partition Receiver and its Parameter values. The image below shows the MySQLPartitionReceiver:

Note: For more information about Partition Receiver requirements for the Partition Engine, click here.

Running Multiple GenRocket Partitions

The following page can be used to learn more about running multiple GenRocket Partitions:

https://genrocket.freshdesk.com/a/solutions/articles/19000080262

Performance Considerations

  1. Operating System – A Linux/Unix operating system is recommended because the Microsoft Windows file system is not as efficient.

  2. Drive Type – A SSD Drive is recommended because, with regular drives, spindle and latency are slow.

  3. Number of CPUs – The higher the number of CPUs, the faster it will generate because Java will use more threads.

  4. Amount of Memory and Space – 1. Things start to degrade if running more than ten instances on a particular box without enough CPUs.

Related Assets

GenRocket - Test Data Management

Download Literature

View Literature
GenRocket - Test Data Management

View Blog

Read More
GenRocket - Test Data Management

How can we help
you today?

Request a Demo

See how GenRocket can solve your toughest test data challenge with quality synthetic data by-design and on-demand