Working with Test Data Factories

Guidance for FHIR IG Creation
0.1.0 - CI Build International flag

Guidance for FHIR IG Creation, published by HL7 International - FHIR Management Group. This guide is not an authorized publication; it is the continuous build for version 0.1.0 built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/FHIR/ig-guidance/ and changes regularly. See the Directory of published versions

Output Destination
Liquid Templates
Profile Based Generation
Defining a Resource Factory
Resource Factory Control File
Source Data Tables
Output Filename
Generation Log
Liquid Processing Rules
Profile Based Generation
Examples

The IG publisher can execute a 'factory' that generates a set of resources from one or more tables of data. The factory is controlled by an json file that sets up the parameters for the factory.

There are two kinds of factories:

Liquid Template
Profile Generation

Note that you can have multiple factories that reuse the same source data tables

Output Destination

There are two uses for Resource Factories:

Generating large amounts of test data
Generating resources to use as examples in the IG

There's overlap between these two, but the key question is around performance. Generating 1000s of resources for use in testing doesn't take very long, but publishing 1000s of resources in an IG makes for a signficantly large IG that takes many minutes to generate.

To generate test data:

Generate to a directory like /tests that isn't scanned by the IG publisher
use the path-test parameter to add the generated resources to the test folder in the package that the IG publisher produces

To generate resources to go in the IG:

generate the resources to a path that is registered using the path-factory parameter
the IG publisher will generate them and then load them like any other resources

Notes:

you control where the resources are generated using the Resource Factory configuration below
Executing the Resource Factories is a one-time operation; you cannot generate resources using profiles, value set and code systems generated by another Resource Factory

Liquid Templates

A liquid template is conceptually simple: a liquid template that constructs an instance of a resource from a set of data. The author provides the liquid script, and the data generation is predictable - just based on what the template and the data source provide. E.g. the data source as a set of columns and the liquid template refers to the columns, laying them out in the resource. One instance is created for each row in the column.

Liquid templates use the FHIR variant of the basic liquid syntax, which uses FHIRPath for expressions in the liquid template.

The liquid template must fully populate the resource, though if it leaves the Resource.id out, an autogenerated id will be added. The liquid template can produce either XML or JSON. In the case of JSON, the resource is treated as JSON5 and converted to normal JSON after it is run - this means that you don't have to get the commas correct in the generated json.

Profile Based Generation

Profile based generation works differently - there is no script laying out the content. Instead, the instances are generated based on the defined profile, including fixed values, pattern values and bindings. The data used in these generated instances comes from one of three sources (in order of preference):

Locally provided data in the form of a spreadsheet with mapping details (see below)
All the data used in published examples in the FHIR ecosystem
Randomly generated garbage data (if there's nothing from the ecosystem)

The details of how the locally provided data works is described below.

Defining a Resource Factory

Data Factories are defined using the parameter test-data-factories:

  <parameter>
    <code>
      <system value="http://hl7.org/fhir/tools/CodeSystem/ig-parameters"/>
      <code value="test-data-factories"/>
    </code>       
    <value value="factories/factories.json"/>
  </parameter>

Multiple test-data-factories are allowed, but since each json file can define multiple factories, there's usually only one entry. By convention, factories are defined in the folder 'factories' but this is not required. The value points to an json file with this format:

Resource Factory Control File

{
  "factories-version" : 1,
  "factories" : [{
    // one entry for each factory
  }
}

Each entry in the factory control file has the following format:

{
  "name" : "{factory-name}",
  "type" : "liquid|profile",
  "liquid" : "{template-file}",
  "profile" : "{url}",
  "data" : "{data-source}",
  "mark-profile" : true|false,
  "filename" : "{filename}",
  "format" : "json|xml",
  "bundle" : true|false,
  "tables" : {
    "name" : "{data-source}",
  },
  "filter" : "{fhirpath expression}",
  "mappings" : [{
    // mapping details - see below
  }]

where:

name (mandatory): the name of the factory
type (mandatory): whether to use a liquid template or the profile driven factory
data-source (mandatory): A path to a source data table containing the data to drive generation (see below)
liquid (if liquid): a relative path to a liquid template that builds a resource
profile (if profile): the URL of a profile to use as the template for generating the instance
mark-profile (if profile): whether to make the profile explicit in the generated resource (in Resource.meta.profile). Note that the IG publisher knows if a resource is generated from a profile; you don't need to fill out the profile explicitly for that
filename (mandatory): A script that controls the name of the output file (see immediately below)
format (optional): the format of the generated file (doesn't have to match the format that a liquid template produces)
bundle (optional): if true, the generated resources will be wrapped into a bundle and only a single file created
tables (optional): other tables
filter (optional): if present, a FHIR Path expression that must evaluate to true or the row is ignored when processing the source data
mapping (if profile): Describes how the data table maps into the generated instances (described below)
As specified in the mode, you nominate either a liquid template or a profile and a mapping script.

Source Data Tables

Source Data can be provided in multiple different forms:

As a text file containing comma separated values (.csv)
As an excel spreadsheet file (.xlsx). Note that you nominate the sheet name by appending ;{name} to the filename. In the absence of a sheet name, the first sheet will be used.
An SQLite db file (*.db). Note that you nominate the table or view name by appending ;{name} to the filename (required)
A value set. Nominate the canonical URL of the ValueSet.

For both .csv and .xlsx, the first row contains the names of the columns.

For all data sources, an additional column named counter is created, which is the index of the current row, a serially incrementing number starting a 1. None of the data sources can provide a column name 'counter' of their own.

Output Filename

The output filename controls where the generated data goes. It is a relative path (relative to the repository root folder). When bundle=true, it's a static filename for the single bundle produced by the generation. In the case where individual resources are produced, the filename is a script that looks like this: test/Patient-{$counter$}.json, where any $xxx$ will be interpreted as a reference to a named column in the primary data source

Generation Log

A log of the process of running the Resource factory will be generated in output/qa-factory-$log-name$.txt. One reason it's provided is to help users see the paths in the profile generation (used below)

Liquid Processing Rules

The liquid template must produce resources in the specified format. If the template produces JSON, the commas do not need to be correct - the json is reprocessed once the liquid script is complete to fix up the commas (it must produce valid json5 output).

Each row of the data table is passed to the liquid template as a 'row' object whose properties are the named columns in the data table. E.g. if the data table has a column name, then the liquid statement {{ row.name }} inserts the value of the name column in the row. The data table should not contain any names containing spaces, or '-'.

The liquid template can produce a resource of any type (doesn't have to produce the same type). If bundle=true, the Liquid template should not produce a Bundle resource unless the desire is to have a Bundle of Bundles - the liquid script will run once for each row of data.

Data Lookup

The tables section of the configuration contains a list of named files. The data in the files will be available in the liquid template using [name].cell(row, col) where:

[name] is the name in the ini file
cell(row,col) gives access to the data. Row is an integer (1 based), and column is either an integer (1 based) or a name
lookup(lookupCol, value, outputCol) looks up a value in lookupCol, and returns the value in outputCol (or null)

Globals

In addition, a Global object is available as Globals. which has the following properties:

dateTime: the date and time in FHIR format of the instant that processing started
path: the path to the base FHIR specification (correct version path)

Profile Based Generation

In this mode, the instances are generated based on the information in the profile. The tighter the profile, the more coherent the generated instance will be.

If provided, an instance will be generated for each row in the primary data source. The intention with regard to the primary data source is to support user provided information. For this reason, there is a mapping table that maps between the source source data and proper FHIR data. The intention here is to support non-technical (e.g. clinical) users to provide the sample data.

Because the data providers aren't expected or required to be technical, here's a list of things the mapping script can do to massage the data into shape ready to go in a resource:

map from a named column to a path in the resource
build a complex data from multiple columns
extract a value from a column text
look up a value based on a number/code

Entry mapping entry looks like this:

{
  "path" : "{path}",
  "fhirType" : "{type}",
  "if" : "{fhirpath expression}",
  "expression" : "{fhirpath expression}",
  "parts" : [{
    "name" : "{prop-name}",
    "expression" : "{fhirpath expression}"
  }]
}

Documentation:

path: the path in the generated instance where the data will go
- The path must match a correct path from the generation log
- The first entry that matches the set of paths will be used
- the path value can be one of the following (but just see the log)
  - The stated path in the profile
  - The underlying StructureDefinition.snapshot.element.id
  - The underlying StructureDefinition.snapshot.element.path
  - A hybrid id - either StructureDefinition.snapshot.element.id or ([extension.url]).value
fhirType - use when the type is polymorphic and not fixed in the profile. Can be either the name of a type, or a FHIRPath expression that returns the name of a type
if - if this is present, evaluate the expression, and only use the entry if the result is true
expression: An expression which evaluates to the value. See below for details
parts: a series of named expressions where the name of each part corresponds to a property name of a type
Each part contains either an expression, or a set of parts

There is 3 ways to refer to a column from the source data in the expression:

just by name, when the name of the column is a valid FHIRPath token e.g. "expression" : "patientId" where patientId is the name of the column in the source data
using the function column(name): e.g. "expression" : "column('Patient ID')" where Patient ID is the name of the column in the source data
using the function dateColumn(name) to wrangle with date formats e.g. "expression" : "column('Date of Birth', 'M/d/yyyy')" where Date of Birth is the name of the column in the source data, and M/d/yyyy is the format of the column. For format advice, see https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html.

Notes:

all columns and cells have surrounding whitespace trimmed from the value
the date time formatter runs in English mode
date handling in excel is complicated, so pay attention to the date formats in the log

Examples

This IG includes some examples. You can find the output from the examples in the package, or you can look in the package source to see how they work

Name

Mode

Script

Flags

Description

LiquidDemo

liquid

factories/patient.liquid

json Bundle

A simple liquid script showing how to look up a random value in a table

PatientGenerator

profile

http://hl7.org/fhir/uv/howto/StructureDefinition/test-patient-profile

json

Generate instances based on a profile in the IG, and fill out values from an excel spreadsheet

EncounterGenerator

liquid

factories/encounter.liquid

json

Another liquid script showing how to do conditional content

BloodPressureGenerator

profile

http://hl7.org/fhir/StructureDefinition/bp

json

A more complex example. Since this is a wide open profile, a lot of what the mappings do is suppress columns

WeightGenerator

profile

http://hl7.org/fhir/StructureDefinition/bodyweight

json

Shows how to to do conditional content depending on the content of the spreadsheet

WarfarinGenerator

profile

http://hl7.org/fhir/StructureDefinition/MedicationStatement

json

Shows how to filter the rows in the first place

IG © 2019+ HL7 International - FHIR Management Group. Package hl7.fhir.uv.howto#0.1.0 based on FHIR 5.0.0. Generated 2025-03-11
Links: Table of Contents | QA Report