Intermediate Data Schema (IDS)

Harmonize Your Experimental Data

Introduction

IDS is short for "Intermediate Data Schema". It is designed by TetraScience in collaboration with instrument manufacturers, scientists, and informatics teams from Life Sciences companies to harmonize different data sets in Life Science industry, such as instrument data, CRO assay data and software data.

The motivation behind extracting and transforming data from various sources into a predictable, consistent and vendor-agnostic schema to allow Life Sciences companies to consume the data in their applications, build searches and aggregations and feed the data into visualization/analysis software seamlessly.

IDS is based on JSON schema and IDS files are JSON files that are typically generated from raw files (such as output from an instrument or report generated by your CRO). The IDS standardizes naming, data type (whether the field is a string, an integer or a date), data range (for example, something needs to be a positive number) and data hierarchy.

All raw data collected from instruments will be parsed by scripts developed by TetraScience following the IDS designed for the specific instrument. Once a raw file is parsed, we called it an IDS file. With IDS, data with various formats can be standardized into one format - JSON. Together with the schema viewer, users can easily find the content stored in each IDS file, and can start searching and querying data by referencing the IDS.

IDS Design

Inside IDS, we attempt to capture as much information as possible, such as

  • time, the time that the data set is related to
  • system(s), the equipment used to produce the result and also software, firmware
  • user(s), the person who performed the experiment
  • sample(s), the sample used in the experiment
  • method(s), the experimental recipe and usually input parameters
  • run(s), a particular execution of the experiment
  • experiment(s), information about the experiment, id, name and etc
  • result(s), the measurement results
  • related_file(s), pointer(s) to related files. e.g. the raw experimental data sets in the vendor specific, and often proprietary, format.
  • datacubes, multi-dimensional data such as chromatogram, images, plate readings and etc.

IDS JSON Example

Each IDS has a type, version and namespace. Each IDS JSON has these 3 keys on the root level.

{
  "@idsType": "cell-counter",
  "@idsVersion": "v1.0.0",
  "@idsNamespace": "common,
  "system": {
    "serial_number": "serial_number"
  },
  "run": {
    "id": "413befdd-c7e2-4edd-9e9b-06cf1cb0283f"
  },
  "time": {
    "measurement": "2015-09-24T03:47:13.0Z"
  },
  "sample": {
    "id": "unknown-10",
    "batch": {
      "id": "batch-number"
    }
  },
  "method": {
    "instrument": {
      "cell_type": "CHO",
      "dilution_factor": 1
    }
  },
  "user": {
    "name": "operator-1"
  },
  "result": {
    "cell": {
      "viability": {
        "value": 0.1,
        "unit": "Percent"
      },
      "diameter": {
        "average": {
          "live": {
            "value": 21.07,
            "unit": "Micrometer"
          }
        }
      },
      "count": {
        "total": {
          "value": 1207,
          "unit": "Cell"
        },
        "viable": {
          "value": 1,
          "unit": "Cell"
        }
      },
      "density": {
        "total": {
          "value": 102.24,
          "unit": "MillionCellsPerMilliliter"
        },
        "viable": {
          "value": 0.1,
          "unit": "MillionCellsPerMilliliter"
        }
      }
    }
  }
}

Data Cubes

In IDS, Data Cubes are used to save multi-dimensional matrix.

{
  "datacubes": [
    {
      "name": "3D chromatogram",
      "description": "More information about the data cube. (Optional)",
      "another_property": "what ever you would like to put in here",
      "measures": [
        {
          "name": "intensity",
          "unit": "ArbitraryUnit",
          "value": [
            [111, 222, 333, 444, 555],
            [111, 222, 333, 444, 555],
            [111, 222, 333, 444, 555]
          ]
        }
      ],
      "dimensions": [
        {
          "name": "wavelength",
          "unit": "Nanometer",
          "scale": [180, 190, 200]
        },
        {
          "name": "time",
          "unit": "MinuteTime",
          "scale": [1, 2, 3, 4, 5]
        }
      ]
    }
  ]
}
1074 694
ExampleMeasure(s)Dimension(s)
Chromatogram1. Detector Intensity1. Wavelength
2. Retention Time
Weather1. Temperature
2. Humidity
1. Longitude
2. Latitude
Plate Reader1. Absorbance
2. Concentration
1. Row Position
2. Column Position

Pointer to large data sets

IDS JSON uses the pointer pattern to for arbitrarily large files.

{
 "result": {
  "images": [{
    "id": "1",
    "location": {
      "fileKey": "uuid/b1020 001.tif",
      "version": "version-value",
      "bucket": "datalake",
      "type": "s3file",
      "fileId": "the uuid for each file in ts data lake"
    }
   }]
}

How to include description and ontology in IDS?

In IDS, we can include description and ontology of your choice.

{
  "type": "object",
  "properties": {
    "@idsType": {
      "type": "string",
      "const": "lc-column"
    },
    "@idsVersion": {
      "type": "string",
      "const": "v1.0.0"
    },
    "@idsNamespace": {
      "type": "string",
      "const": "common"
    },
    "column": {
      "type": "object",
      "properties": {
        "target_temperature": {
          "type": "number",
          "description": "Target temperature of liquid chromatography column during an injection, unit in Degree Celcius",
          "@type": "http://purl.company.org/column#lc-0001",
          "@prefLabel": "Column Target Temperature"
        }
      }
    }
  }
}

As mentioned, IDS is designed in collaboration with instrument manufacturers, and Life Sciences companies, the design is also heavily influenced by Allotrope Foundation.

In the IDS, you can add optional fields such as @type and @id to encoding Linked Data. TetraScience has an application called [ADF Converter(https://www.tetrascience.com/products/adf-converter) to convert the IDS JSON into HDF5 based ADF file using the Leaf Node pattern.

SQL Tables

SQL tables are automatically generated based on the IDS. Each array of objects defined in the IDS has an associated SQL table exposed via JDBC and ODBC. For more information, refer to SQL Tables.

TetraScience SQL interface has restrictions on what tables/columns/partitions can be named.

  • lowercase
  • no special characters, except underscore (_)

Because IDS property names can include uppercase and special characters we need to sanitize the name to conform to the table spec. The following transformations rules are applied:

  • to lowercase
  • replace all special chars with an underscore
  • replace all spaces with an underscore
  • remove repeated underscores
  • remove all leading underscore

Examples:

IDS Key NameTable Column Name
@fileIdfileid
_strange__key_@namestrange_key_name

View IDS

You can view available IDS by clicking the menu button on the upper left corner on TetraScience platform. Then click "Data Schemas".

1905

Go to "Data Schemas" page

Then select one schema. You can click on the property to expand it if it's an object or an array.

1920

"Data Schemas" page

Versions

TetraScience uses Semantic Versioning for IDS versions. Every time there is a schema change, the IDS version will be updated. And there will be a new script version and protocol version available. But all your existing pipelines will keep using the old versions unless you update them. This is to ensure data consistency for compliance. If at any time you feel like you want to use the new IDS, you can just edit your pipeline protocols by selecting the newest version.