Intermediate Data Schema (IDS)

Harmonize Your Experimental Data

Introduction

IDS is short for "Intermediate Data Schema". It is designed by TetraScience in collaboration with instrument manufacturers, scientists, and informatics teams from Life Sciences companies to harmonize different data sets in Life Science industry, such as instrument data, CRO assay data and software data.

The motivation behind extracting and transforming data from various sources into a predictable, consistent and vendor-agnostic schema to allow Life Sciences companies to consume the data in their applications, build searches and aggregations and feed the data into visualization/analysis software seamlessly.

IDS is based on JSON schema and IDS files are JSON files that are typically generated from raw files (such as output from an instrument or report generated by your CRO). The IDS standardizes naming, data type (whether the field is a string, an integer or a date), data range (for example, something needs to be a positive number) and data hierarchy.

All raw data collected from instruments will be parsed by scripts developed by TetraScience following the IDS designed for the specific instrument. Once a raw file is parsed, we called it an IDS file. With IDS, data with various formats can be standardized into one format - JSON. Together with the schema viewer, users can easily find the content stored in each IDS file, and can start searching and querying data by referencing the IDS.

IDS Design

Inside IDS, we attempt to capture as much information as possible, such as

time, the time that the data set is related to
system(s), the equipment used to produce the result and also software, firmware
user(s), the person who performed the experiment
sample(s), the sample used in the experiment
method(s), the experimental recipe and usually input parameters
run(s), a particular execution of the experiment
experiment(s), information about the experiment, id, name and etc
result(s), the measurement results
related_file(s), pointer(s) to related files. e.g. the raw experimental data sets in the vendor specific, and often proprietary, format.
datacubes, multi-dimensional data such as chromatogram, images, plate readings and etc.

IDS JSON Example

Each IDS has a type, version and namespace. Each IDS JSON has these 3 keys on the root level.

{
  "@idsType": "cell-counter",
  "@idsVersion": "v1.0.0",
  "@idsNamespace": "common,
  "system": {
    "serial_number": "serial_number"
  },
  "run": {
    "id": "413befdd-c7e2-4edd-9e9b-06cf1cb0283f"
  },
  "time": {
    "measurement": "2015-09-24T03:47:13.0Z"
  },
  "sample": {
    "id": "unknown-10",
    "batch": {
      "id": "batch-number"
    }
  },
  "method": {
    "instrument": {
      "cell_type": "CHO",
      "dilution_factor": 1
    }
  },
  "user": {
    "name": "operator-1"
  },
  "result": {
    "cell": {
      "viability": {
        "value": 0.1,
        "unit": "Percent"
      },
      "diameter": {
        "average": {
          "live": {
            "value": 21.07,
            "unit": "Micrometer"
          }
        }
      },
      "count": {
        "total": {
          "value": 1207,
          "unit": "Cell"
        },
        "viable": {
          "value": 1,
          "unit": "Cell"
        }
      },
      "density": {
        "total": {
          "value": 102.24,
          "unit": "MillionCellsPerMilliliter"
        },
        "viable": {
          "value": 0.1,
          "unit": "MillionCellsPerMilliliter"
        }
      }
    }
  }
}

Data Cubes

In IDS, Data Cubes are used to save multi-dimensional matrix.

{
  "datacubes": [
    {
      "name": "3D chromatogram",
      "description": "More information about the data cube. (Optional)",
      "another_property": "what ever you would like to put in here",
      "measures": [
        {
          "name": "intensity",
          "unit": "ArbitraryUnit",
          "value": [
            [111, 222, 333, 444, 555],
            [111, 222, 333, 444, 555],
            [111, 222, 333, 444, 555]
          ]
        }
      ],
      "dimensions": [
        {
          "name": "wavelength",
          "unit": "Nanometer",
          "scale": [180, 190, 200]
        },
        {
          "name": "time",
          "unit": "MinuteTime",
          "scale": [1, 2, 3, 4, 5]
        }
      ]
    }
  ]
}

Example	Measure(s)	Dimension(s)
Chromatogram	1. Detector Intensity	1. Wavelength 2. Retention Time
Weather	1. Temperature 2. Humidity	1. Longitude 2. Latitude
Plate Reader	1. Absorbance 2. Concentration	1. Row Position 2. Column Position

Pointer to large data sets

IDS JSON uses the pointer pattern to for arbitrarily large files.

{
 "result": {
  "images": [{
    "id": "1",
    "location": {
      "fileKey": "uuid/b1020 001.tif",
      "version": "version-value",
      "bucket": "datalake",
      "type": "s3file",
      "fileId": "the uuid for each file in ts data lake"
    }
   }]
}

How to include description and ontology in IDS?

In IDS, we can include description and ontology of your choice.

{
  "type": "object",
  "properties": {
    "@idsType": {
      "type": "string",
      "const": "lc-column"
    },
    "@idsVersion": {
      "type": "string",
      "const": "v1.0.0"
    },
    "@idsNamespace": {
      "type": "string",
      "const": "common"
    },
    "column": {
      "type": "object",
      "properties": {
        "target_temperature": {
          "type": "number",
          "description": "Target temperature of liquid chromatography column during an injection, unit in Degree Celcius",
          "@type": "http://purl.company.org/column#lc-0001",
          "@prefLabel": "Column Target Temperature"
        }
      }
    }
  }
}

As mentioned, IDS is designed in collaboration with instrument manufacturers, and Life Sciences companies, the design is also heavily influenced by Allotrope Foundation.

In the IDS, you can add optional fields such as @type and @id to encoding Linked Data. TetraScience has an application called [ADF Converter(https://www.tetrascience.com/products/adf-converter) to convert the IDS JSON into HDF5 based ADF file using the Leaf Node pattern.

SQL Tables

SQL tables are automatically generated based on the IDS. Each array of objects defined in the IDS has an associated SQL table exposed via JDBC and ODBC. For more information, refer to SQL Tables.

TetraScience SQL interface has restrictions on what tables/columns/partitions can be named.

lowercase
no special characters, except underscore (_)

Because IDS property names can include uppercase and special characters we need to sanitize the name to conform to the table spec. The following transformations rules are applied:

to lowercase
replace all special chars with an underscore
replace all spaces with an underscore
remove repeated underscores
remove all leading underscore

Examples:

IDS Key Name	Table Column Name
`@fileId`	`fileid`
`_strange__key_@name`	`strange_key_name`

View IDS

You can view available IDS by clicking the menu button on the upper left corner on TetraScience platform. Then click "Data Schemas".

Then select one schema. You can click on the property to expand it if it's an object or an array.

Versions

TetraScience uses Semantic Versioning for IDS versions. Every time there is a schema change, the IDS version will be updated. And there will be a new script version and protocol version available. But all your existing pipelines will keep using the old versions unless you update them. This is to ensure data consistency for compliance. If at any time you feel like you want to use the new IDS, you can just edit your pipeline protocols by selecting the newest version.

Updated over 1 year ago