Breadcrumbs

Creating a custom test data generator bundle

MettleCI Workbench can fabricate authentic-looking data using built-in data generators or custom generators defined by users. You can define one or more custom generators by supplying one or more appropriately-formatted JSON files (called a ‘bundle file’) to the MettleCI Workbench, after which your new generators are immediately available in the MettleCI Workbench test data editor. Each bundle file can contain any number of generators, each of which is a member of a category that broadly describes the type of data it provides.

File Structure - Templates and Generators

The overall structure of a data fabrication JSON bundle file is defined by Templates, Categories, and Generators.

  • Templates are one or more hierarchically-organised lists of data items from which a randomly extracted value will be provided to MettleCI Workbench.

  • Categories, which are named groupings of semantically similar generators.

  • Generators, which are essentially DataStage expressions (with some enhancements, covered below) that provide fabricated data to MettleCI Workbench. These expressions can use DataStage functions, regular expressions, values defined in the templates section of the same file, or any combination of these. Generators are referenced using the {category_name}.{generator_name} format.

For example, here’s a simple bundle file containing a single category (band) which contains a single generator (beatles-member) which generates a random name of a member of The Beatles, which it gets by choosing a random value from a template (beatles-members).

JSON
{
    "templates": {
        "beatles-members": [ "John Lennon", "George Harrison", "Paul McCartney", "Ringo Starr"]    
    },
    "categories": [
        {
            "id": "band",
            "name": "Band",
            "generators": [
                {
                    "id": "beatles-member",
                    "name": "Member name",
                    "description": "A random member of a band",
                    "type": "STRING",
                    "expression": "template('beatles-members')"
                }
            ]
        }
    ]
}

Note that …

  • We’ve defined a template array called beatles-members containing four items - the band members' names. The name of the template is for internal use within the bundle file, and has no bearing on the category.generator name exposed to MettleCI Workbench.

  • We’ve defined a single category which acts as a container for a set of generators. It is used as the first level of generator selection in the MettleCI Workbench user interface, and isolates its contituent generators, within the band namespace, in this case, separating them from any other generators with the same name which might exist in other categories.

  • Our category has a couple of properties:

    • id - the reference used to prefix calls to the generators within this category.

    • name - the label used to prefix these generator names in the MettleCI Workbench user interface.

  • Within this category we have defined a single generator with the following properties:

    • id - The name of this generator, which in this case is beatles-member. Together with the category entry we now have the fully-qualified reference to this generator: band.beatles-member.

    • label - The text label used to name this generator in the MettleCI Workbench user interface.

    • description - A textual description of the generator, displayed by the mettleci fabrication list command.

    • type - Defines what data type is output by this generator and hence the column types for which this generator will be available. Valid types are NULL, BOOLEAN, STRING, INTEGER, DECIMAL, DATE, TIME, and TIMESTAMP.

    • expression - The key component of the generator definition! The expressions here uses the same expression language as the DataStage expressions you’d use in a DataStage transformer or when selecting the Utility → Custom Expression data fabrication tool (documentation) in the MettleCI Workbench test data editor.

The four differences in the capabilities of the DataStage expression language when used within the context of a custom generator expression are:

  • the row.{column_name} reference is not available, as this has no context within generator expressions.

  • a new template(template_name) function is made available, which selects a random value from the named template which must exist within the same bundle file.

  • a new params.{parameter-id} reference is made available, which is substituted with the value of the named generator parameter. This topic is covered further down this page.

  • a new switch(expression, value1, result1, [value2, result2], ..., [default]) function is made available, which has a syntax and behaviour modelled after the identically named Microsoft Excel function (search results). An example of its usage is provided further down this page.

The two places from which this generator can be invoked are:

Assuming this bundle file was saved as bands.json then the command to test this generator would be:

Bash
$>  mettleci fabrication test -path music.json -generator band.beatles-member
MettleCI Command Line (build 221)
(C) 2018-2022 Data Migrators Pty Ltd
fabrication test (v1.0-SNAPSHOT)
George Harrison
George Harrison
Paul McCartney
John Lennon
George Harrison

Template Hierarchy

Sometimes you may want to give your template more structure to help manage the taxonomy of your data. For this reason templates can be organized hierarchically, with no enforced limits on the number of hierarchical levels. Let’s extend our example from above and see how adding more bands (The Rolling Stones and Queen) might cause us to restructure our templates:

JSON
{
    "templates": {
        "bands": {
            "beatles": {
                "members": [ "John Lennon", "George Harrison", "Paul McCartney", "Ringo Starr" ]
            },
            "stones": {
                "members": [ "Mick Jagger", "Keith Richards", "Charlie Watts", "Ronnie Wood" ]
            },
            "queen": {
                "members": [ "Freddie Mercury", "Brian May", "Roger Taylor", "John Deacon" ]            
            }
        }
    },
    "categories": [
        {
            "id": "band",
            "name": "Band",
            "generators": [
                {
                    "id": "member",
                    "name": "Member name",
        			"description": "The name of a random member of The Beatles",
                    "type": "STRING",
                    "expression": "template('bands.beatles.members')"
                }
            ]
        }
    ]
}              

Note that …

  • We’ve nested our templates using the hierarchy band{band-name}members. Using that structure we could, if we wanted, introduce further template data under band{band-name}albums, or something similar.

  • Our beatles.name generator must now access the template data using a different JSON path in its expression: template('bands.beatles.members').

  • We don’t currently have a generator providing access to the information about the new bands we’ve added! We’ll add one in the next section.

Generator Parameters

Now we need to enabled users to access our template data for the new bands we've added. We could copy the approach we took for The Beatles and add a generator for each new band, like this:

JSON
    "generators": [
        {    ...    },
        {
            "id": "stones-member",
            "name": "Member name",
            "type": "STRING",
            "expression": "template('bands.stones.members')"
        },
        {
            "id": "queen-member",
            "name": "Member name",
            "type": "STRING",
            "expression": "template('bands.queen.members')"
        }
    ]

This approach isn’t very scalable, though. A better approach would be to allow users to specify which band’s members we want using a single generator with a generator parameter:

JSON
    "generators": [
        {
            "id": "member",
            "name": "Member name",
            "description": "A random member of a band",
            "type": "STRING",
        	"parameters": [
				{
					"id": "band",
					"name": "Name of band",
					"type": "STRING",
					"description": "Name of the band whose members to generate",
                  
					"nullable": true,
					"editor": {
						"type": "SelectList",
						"values": {
							"beatles": "The Beatles",
							"stones": "The Rolling Stones",
							"queen": "Queen"
						}
					}
				}
			],
            "expression": "switch(params.band, 'beatles','bands.beatles.members', 'stones','bands.stones.members', 'queen','bands.queen.members')"
        }
    ]

We’ve introduced a parameters object to the generator with the following options:

  • id - The name of the parameter when referring to it in an expression.

  • name - The text label used to name this generator parameter in the MettleCI Workbench user interface.

  • type - The type of the parameter. As for generator output, the valid types are BOOLEAN, STRING, INTEGER, DECIMAL, DATE, TIME, and TIMESTAMP.

  • description - A textual description of the generator, used by the MettleCI Workbench user interface and by the mettleci fabrication list command.

  • nullable - Whether this parameter is mandatory or optional (nullable).

MettleCI Workbench will dynamically construct an appropriate user interface to enable the user to provide your generator's defined parameters. It will select appropriate user interface elements based on the parameter types, such as string fields for STRING, date selectors for DATE, and checkboxes for BOOLEAN, for example. In this case we’d like our parameter to be presented as a drop-down list of possible values. We do this using the editor property:

  • editor (optional) - A special nested object with a set of properties that define how we want our parameters to be presented in the MettleCI Workbench user interface. Component properties are …

    • type - SelectList tells the fabrication system that we want our parameter to offer a drop-down list of potential values. This is currently the only special editor option available for user-defined fabrication bundles.

    • values - A set of key:value pairs which describe, for each potential parameter value, the value’s name (used internally) and its text label (used in the MettleCI Workbench user interface.)

You’ll notice that the data generator expression uses a new switch() function (specific to MettleCI data fabrication) which is modelled after the Microsoft Excel switch() function (search results).

While this approach works, we can derive our template name using a more concise expression:

JSON
    "generators": [
        {
            "id": "member",
            "name": "Member name",
            "description": "A random member of a band",
            "type": "STRING",
        	"parameters": [
				{ ... }
			],
            "expression": "template('bands.':params.band:'.members')"
        }
    ]

We’ve now modified the expression to use the selected parameter directly to dynamically build (using the DataStage concatenation operator) the name of the template we’re going to use. This is done using a reference of the form params.{parameter-name}.

Handling nulls

One important point to note is that our parameter is nullable! This means that a user may choose not to supply a parameter value, but in this case we will generate an invalid template reference - template('bands..members') - so we need to introduce another technique for handling this situation. We’ll do this by:

  • Updating our generator expression to explicitly handle a null parameter and call a new template designed to handle it, and

  • Creating a new, null parameter-specific template which uses template references to consolidate data from multiple templates under a single name.

Remember that generator expressions are just regular DataStage expressions. To handle the null parameter in an expression we’ll modify our expression to capture instances where our parameter is null:

JSON
# The old expression was ...
  "expression": "template('bands.':params.band:'.members')"

# A new expression could be ...
  "expression": "If IsNull(params.band) Then template('bands.all.members') Else template('bands.':params.band:'.members')"

# But a better version would be ...
  "expression": "template('bands.':NullToValue(params.band,'all'):'.members')"

You’ll see that a null parameter (handled with DataStage’s useful NullToValue function) causes us to retrieve data from a new template - bands.all.members. Let’s create that.

Composing Templates

We could configure null parameters to use a template bands.all.members created the hard way …

JSON
        "bands": {
            "all": {
                "members": [
                    "John Lennon", "George Harrison", "Paul McCartney", "Ringo Starr",
                    "Mick Jagger", "Keith Richards", "Charlie Watts", "Bill Wyman", "Ronnie Wood",
                    "Freddie Mercury", "Brian May", "Roger Taylor", "John Deacon"
                ]            
            }

… but there are many obvious problems with this approach. The smart way is to use template references to consolidate data from our existing band-specific templates under our new bands.all.members template name:

JSON
    "templates": {
        "bands": {
            "beatles": { 
                "members": [ ... ] },
            "stones": { 
                "members": [ ... ]  },
            "queen": { 
                "members": [ ... ]  },
            "all": {
                "members": [
                    "#{bands.beatles.members}", "#{bands.stones.members}", "#{bands.queen.members}"
                ]            
            }

You’ll see that we’ve effectively defined a new template with none of its own values, but it uses a number of references of the form #{template-JSON-path} to include values from other templates defined in the same file.

At this stage our music.json bundle file now looks like this:

JSON
{
    "templates": {
        "bands": {
            "beatles": {
                "members": ["John Lennon", "George Harrison", "Paul McCartney", "Ringo Starr"]
            },
            "stones": {
                "members": ["Mick Jagger", "Keith Richards", "Charlie Watts", "Bill Wyman", "Ronnie Wood"]
            },
            "queen": {
                "members": ["Freddie Mercury", "Brian May", "Roger Taylor", "John Deacon"]            
            },
            "all": {
                "members": ["#{bands.beatles.members}", "#{bands.stones.members}", "#{bands.queen.members}"]            
            }
        }
    },
    "categories": [
        {
            "id": "band",
            "name": "Band",
            "generators": [
                {
                    "id": "member",
                    "name": "member",
                    "label": "Member name",
                    "description": "A random member of a band",
                    "type": "STRING",
                	"parameters": [
        				{
        					"name": "band",
        					"type": "STRING",
        					"label": "Name of band",
        					"description": "Name of the band whose members to generate",
        					"nullable": true,
        					"editor": {
        						"type": "SelectList",
        						"values": {
        							"beatles": "The Beatles",
        							"stones": "The Rolling Stones",
        							"queen": "Queen"
        						}
        					}
        				}
        			],
                    "expression": "template('bands.':NullToValue(params.band,'all'):'.members')"
                }
            ]
        }
    ]
}

Here’s an example of this bundle being testing using the mettleci fabrication test command:

Testing the music.json bundle.
Bash
$> mettleci fabrication test -path bands.json -generator band.member -Pband=beatles
MettleCI Command Line (build 221)
(C) 2018-2022 Data Migrators Pty Ltd
fabrication test (v1.0-SNAPSHOT)
Ringo Starr
George Harrison
Paul McCartney
George Harrison
George Harrison

$> mettleci fabrication test -path bands.json -generator band.member -Pband=stones
MettleCI Command Line (build 221)
(C) 2018-2022 Data Migrators Pty Ltd
fabrication test (v1.0-SNAPSHOT)
Charlie Watts
Ronnie Wood
Bill Wyman
Ronnie Wood
Mick Jagger

$> mettleci fabrication test -path bands.json -generator band.member -Pband=queen
MettleCI Command Line (build 221)
(C) 2018-2022 Data Migrators Pty Ltd
fabrication test (v1.0-SNAPSHOT)
Roger Taylor
Roger Taylor
Roger Taylor
Roger Taylor
Freddie Mercury

$> mettleci fabrication test -path bands.json -generator band.member
MettleCI Command Line (build 221)
(C) 2018-2022 Data Migrators Pty Ltd
fabrication test (v1.0-SNAPSHOT)
Brian May
George Harrison
Charlie Watts
Freddie Mercury
Ringo Starr

Combining template values and generator expressions

Sometimes you might want to mix values from a template with values derived from an expression, such as a MettleCI data fabrication regex() call, for example. Let's look at two ways to achieve this.

We’ll start by adding some more information about our bands - some albums they’ve released, along with the year of release. We’ll also extend our bands.all template to compose those values into a single list …

JSON
        "bands": {
            "beatles": { "members": [ ... ],
                         "albums": [ "Sgt. Pepper’s Lonely Hearts Club Band (1967)", "Revolver (1966)", "Abbey Road (1969)" ]
            },
            "stones": {  "members": [ ... ],
                         "albums": [ "Exile on Main St. (1972)", "Let It Bleed (1969)", "Sticky Fingers (1971)" ]
            },
            "queen": {   "members": [ ... ],
                         "albums": [ "A Night at the Opera (1975)", "News of the World (1977)", "The Game (1980)" ]            
            },
            "all": {     "members": [ ... ],
                         "albums": [ "#{bands.beatles.albums}", "#{bands.stones.albums}", "#{bands.queen.albums}" ]            
            }
        }

We’ll also need to expose these new album data using a new generator (with a parameter like the one we’ve already seen) …

JSON
        {
            "id": "album",
            "name": "Album name",
            "description": "A random album by a band",
            "type": "STRING",
        	"parameters": [
				{
					"id": "band",
					"name": "Band name",
					"type": "STRING",
					"description": "Name of the band whose album to generate",
					"nullable": true,
					"editor": {
						"type": "SelectList",
						"values": {
							"beatles": "The Beatles",
							"stones": "The Rolling Stones",
							"queen": "Queen"
						}
					}
				}
			],
            "expression": "template('bands.':NullToValue(params.band,'all'):'.albums')"
        }

This enables us to fabricate a list of authentic album names with their year of release …

Bash
$> mettleci fabrication test -path bands.json -generator band.album
MettleCI Command Line (build 221)
(C) 2018-2022 Data Migrators Pty Ltd
fabrication test (v1.0-SNAPSHOT)
Exile on Main St. (1972)
Exile on Main St. (1972)
Abbey Road (1969)
A Night at the Opera (1975)
Sticky Fingers (1971)

That’s all well and good, but what if we wanted to randomize the year of release as well? There are two simple approaches to this.

Using templates as part of expressions

The simplest approach is to remove the year component ( (19xx)) from the album description and add this in the generator expression. We could achieve randomization using a regex() function to generate a plausible-looking random year between, say, 1965 and 1980:

JSON
"expression": "template('bands.':NullToValue(params.band,'all'):'.albums'):' (':regex('(196[5-9]|197[0-9]|1980)'):')'"

Using template placeholders with the ereplace function

Another approach would be to define placeholder values in the values defined in our template lists and use the DataStage ereplace() function (documentation) to substitute that for an expression in the generator.

Let’s start by adding placeholder values to our template. These could be anything you like, but for our example we’re going to use the value <YEAR> to denote the position in which we want our generator to insert a randomized year…

JSON
        "bands": {
            "beatles": {  "members": [ ... ],
                          "albums": [ "Sgt. Pepper’s Lonely Hearts Club Band (<YEAR>)", "Revolver (<YEAR>)", "Abbey Road (<YEAR>)" ]
            },
            "stones": {   "members": [ ... ],
                          "albums": [ "Exile on Main St. (<YEAR>)", "Let It Bleed (<YEAR>)", "Sticky Fingers (<YEAR>)" ]
            },
            "queen": {    "members": [ ... ],
                          "albums": [ "A Night at the Opera (<YEAR>)", "News of the World (<YEAR>)", "The Game (<YEAR>" ]            
            },
            "all": {      "members": [ ... ],
                          "albums": [ "#{bands.beatles.albums}", "#{bands.stones.albums}", "#{bands.queen.albums}" ]            
            }
        } 

Now we just need to replace that value with the result of our expression in our generator…

JSON
# Added newlines for clarity, but in your JSON it would all be on a single line
"expression": "ereplace(
                  template('bands.':NullToValue(params.band,'all'):'.albums'),
                  '<YEAR>',
                  regex('(196[5-9]|197[0-9]|1980)')
               )"

Using templates containing regular expressions

You can also use templates to store a list of potential expressions for use with the regex function in an expression.

Let’s say we wanted to expand our example to include a larger number of bands and generate a plausible-looking music publisher catalog number for each album. Each of the publishers involved with our bands has a different format for their catalog number which we can store as regular expressions in a template:

JSON
        "publishers": {
            ...
            "catalog_number": [
                "(APPLE|APCOR|SAPCOR|PCS|R|12R|SKBO|SMAS|SW|ST|SO)[ \\-]?[0-9]{2,5}",
                "(CDP|CLJ|ST|SM|T|E)[ ]?[0-9]{4,6}([ \\-][0-9])?",
                "(EMI|CDP|12EMI|7EMI|7243|50999)[0-9 \\-]{5,20}",
                "(0289|423|475|477)[ \\-]?[0-9]{3}[ \\-]?[0-9]{4,5}",
                "(PD|RDJ|BVCA|RCA)[ ]?[0-9]{4,6}(-[0-9])?",
                "(EK|COL|ESCA|ESCB|SRCS|BVCP)?[ ]?[0-9]{5,7}(-[0-9]{1,2})?",
                "(06025|B[0-9]{6,7})[ \\-]?[0-9]{4,7}(-[0-9]{2})?",
                "(CDV|VSCDT|7243)[ ]?[0-9]{1,7}([ \\-]?[0-9]{1,4})*",
                "(R2|8122)[ \\-]?[0-9]{4,6}(-[0-9]{1,2})?"
            ],
            ...

… and create a generator to either select one at random, or allow the user to specify a catalog number from a specific publisher:

JSON
{
    "id": "catalog_number",
    "name": "Catalog number",
    "description": "The album's catalog number",
    "type": "STRING",
    "expression": "regex(template('publishers.catalog_number'))"
}

Referential Integrity

MettleCI data fabrication does not currently support the generation of referentially integrated data. There is no ability to generate, on a particular row, a band member and band album which is guaranteed to be from the same band, for example.

Tips

  • Limit the number of hierarchical levels you define in your templates. A maximum of three levels is a good rule of thumb.

  • Your templates are used to provide indicative, sample data. For many data domains a smaller selection may suffice. We generally use a rule of thumb of keeping our templates to no more than 50 values.

  • Don’t use templates when an expression will do. e.g. don’t keep a list of plausible-looking Social Security Numbers when a regular expression that does the same thing (with greater variability) will suffice.

  • Remember to add descriptions to all your generators and generator parameters as this provides clarity to those accessing your generators in the MettleCI Workbench.

  • Please consider sharing your data generators with the DataStage community by submitting a pull request to the public GitHub repository at https://github.mettleci.io/datafab. Instructions for doing this are in the repository README file. Please ensure any data generator definitions submitted to this public repository do not contain any Intellectual Property or Personally Identifiable Information.