Intermediate Data Schema (IDS)
Harmonize Your Experimental Data
Introduction
IDS is short for "Intermediate Data Schema". It is designed by TetraScience in collaboration with instrument manufacturers, scientists, and informatics teams from Life Sciences companies to harmonize different data sets in Life Science industry, such as instrument data, CRO assay data and software data.
The motivation behind extracting and transforming data from various sources into a predictable, consistent and vendor-agnostic schema to allow Life Sciences companies to consume the data in their applications, build searches and aggregations and feed the data into visualization/analysis software seamlessly.
IDS is based on JSON schema and IDS files are JSON files that are typically generated from raw files (such as output from an instrument or report generated by your CRO). The IDS standardizes naming, data type (whether the field is a string, an integer or a date), data range (for example, something needs to be a positive number) and data hierarchy.
All raw data collected from instruments will be parsed by scripts developed by TetraScience following the IDS designed for the specific instrument. Once a raw file is parsed, we called it an IDS file. With IDS, data with various formats can be standardized into one format - JSON. Together with the schema viewer, users can easily find the content stored in each IDS file, and can start searching and querying data by referencing the IDS.
IDS Design
Inside IDS, we attempt to capture as much information as possible, such as
- time, the time that the data set is related to
- system(s), the equipment used to produce the result and also software, firmware
- user(s), the person who performed the experiment
- sample(s), the sample used in the experiment
- method(s), the experimental recipe and usually input parameters
- run(s), a particular execution of the experiment
- experiment(s), information about the experiment, id, name and etc
- result(s), the measurement results
- related_file(s), pointer(s) to related files. e.g. the raw experimental data sets in the vendor specific, and often proprietary, format.
- datacubes, multi-dimensional data such as chromatogram, images, plate readings and etc.
IDS JSON Example
Each IDS has a type, version and namespace. Each IDS JSON has these 3 keys on the root level.
{
"@idsType": "cell-counter",
"@idsVersion": "v1.0.0",
"@idsNamespace": "common,
"system": {
"serial_number": "serial_number"
},
"run": {
"id": "413befdd-c7e2-4edd-9e9b-06cf1cb0283f"
},
"time": {
"measurement": "2015-09-24T03:47:13.0Z"
},
"sample": {
"id": "unknown-10",
"batch": {
"id": "batch-number"
}
},
"method": {
"instrument": {
"cell_type": "CHO",
"dilution_factor": 1
}
},
"user": {
"name": "operator-1"
},
"result": {
"cell": {
"viability": {
"value": 0.1,
"unit": "Percent"
},
"diameter": {
"average": {
"live": {
"value": 21.07,
"unit": "Micrometer"
}
}
},
"count": {
"total": {
"value": 1207,
"unit": "Cell"
},
"viable": {
"value": 1,
"unit": "Cell"
}
},
"density": {
"total": {
"value": 102.24,
"unit": "MillionCellsPerMilliliter"
},
"viable": {
"value": 0.1,
"unit": "MillionCellsPerMilliliter"
}
}
}
}
}
Data Cubes
In IDS, Data Cubes are used to save multi-dimensional matrix.
{
"datacubes": [
{
"name": "3D chromatogram",
"description": "More information about the data cube. (Optional)",
"another_property": "what ever you would like to put in here",
"measures": [
{
"name": "intensity",
"unit": "ArbitraryUnit",
"value": [
[111, 222, 333, 444, 555],
[111, 222, 333, 444, 555],
[111, 222, 333, 444, 555]
]
}
],
"dimensions": [
{
"name": "wavelength",
"unit": "Nanometer",
"scale": [180, 190, 200]
},
{
"name": "time",
"unit": "MinuteTime",
"scale": [1, 2, 3, 4, 5]
}
]
}
]
}
Example | Measure(s) | Dimension(s) |
---|---|---|
Chromatogram | 1. Detector Intensity | 1. Wavelength 2. Retention Time |
Weather | 1. Temperature 2. Humidity | 1. Longitude 2. Latitude |
Plate Reader | 1. Absorbance 2. Concentration | 1. Row Position 2. Column Position |
Pointer to large data sets
IDS JSON uses the pointer pattern to for arbitrarily large files.
{
"result": {
"images": [{
"id": "1",
"location": {
"fileKey": "uuid/b1020 001.tif",
"version": "version-value",
"bucket": "datalake",
"type": "s3file",
"fileId": "the uuid for each file in ts data lake"
}
}]
}
How to include description and ontology in IDS?
In IDS, we can include description and ontology of your choice.
{
"type": "object",
"properties": {
"@idsType": {
"type": "string",
"const": "lc-column"
},
"@idsVersion": {
"type": "string",
"const": "v1.0.0"
},
"@idsNamespace": {
"type": "string",
"const": "common"
},
"column": {
"type": "object",
"properties": {
"target_temperature": {
"type": "number",
"description": "Target temperature of liquid chromatography column during an injection, unit in Degree Celcius",
"@type": "http://purl.company.org/column#lc-0001",
"@prefLabel": "Column Target Temperature"
}
}
}
}
}
As mentioned, IDS is designed in collaboration with instrument manufacturers, and Life Sciences companies, the design is also heavily influenced by Allotrope Foundation.
In the IDS, you can add optional fields such as @type
and @id
to encoding Linked Data. TetraScience has an application called [ADF Converter(https://www.tetrascience.com/products/adf-converter) to convert the IDS JSON into HDF5 based ADF file using the Leaf Node pattern.
SQL Tables
SQL tables are automatically generated based on the IDS. Each array of objects defined in the IDS has an associated SQL table exposed via JDBC and ODBC. For more information, refer to SQL Tables.
TetraScience SQL interface has restrictions on what tables/columns/partitions can be named.
- lowercase
- no special characters, except underscore (_)
Because IDS property names can include uppercase and special characters we need to sanitize the name to conform to the table spec. The following transformations rules are applied:
- to lowercase
- replace all special chars with an underscore
- replace all spaces with an underscore
- remove repeated underscores
- remove all leading underscore
Examples:
IDS Key Name | Table Column Name |
---|---|
@fileId | fileid |
_strange__key_@name | strange_key_name |
View IDS
You can view available IDS by clicking the menu button on the upper left corner on TetraScience platform. Then click "Data Schemas".
Then select one schema. You can click on the property to expand it if it's an object or an array.
Versions
TetraScience uses Semantic Versioning for IDS versions. Every time there is a schema change, the IDS version will be updated. And there will be a new script version and protocol version available. But all your existing pipelines will keep using the old versions unless you update them. This is to ensure data consistency for compliance. If at any time you feel like you want to use the new IDS, you can just edit your pipeline protocols by selecting the newest version.
Updated 10 months ago