The "Intermediate Data Schema" (IDS) -- which was designed by TetraScience in collaboration with instrument manufacturers, scientists, and informatics teams from Life Sciences companies -- is a schema that is applied to raw instrument data or generated report files. This schema is used to map vendor-specific information (like the name of a field) to vendor-agnostic information. The IDS standardizes naming, data type (whether the field is a string, an integer or a date), data range (for example, something needs to be a positive number) and data hierarchy.
By doing this, the IDS harmonizes different data sets in Life Science industry, such as instrument data, CRO assay data and software data. This allows Life Sciences companies to consume the data in their applications, build searches and aggregations and feed the data into visualization/analysis software seamlessly because the IDS generated JSON files are predictable, consistent and vendor-agnostic.
Raw data files (called "artifacts") are uploaded to the data lake and are then parsed by TetraScience code (data pipelines), which among other things, generates a JSON file using the IDS designed for that specific instrument as a guide. During processing, the data is validated and search indexes are also generated. The newly generated JSON file thus is an IDS artifact. The IDS artifact is a JSON representation of the RAW data. Other files are generated as well, such as logs and search mappings. The IDS file, as well as the other files can be quickly and easily viewed in the TDP.
Once IDS artifacts are generated, you can query the data in the TDP, using SQL from the TDP or a tool of your choice, or with an API call using the tool of your choice. If you have questions as you review your data, you can refer to IDS examples and readme files in the TDP so that you can reference the structure and better understand how the data maps to the original RAW file.
The following table lists terminology related to this topic.
A digital item such as a file. An artifact:
IDS is one of the three artifact types in TDP. Artifacts have associated files, such as a readme file.
A multi-dimensional array of values. IDS data is stored in one or more data cubes.
Intermediate Data Schema (IDS)
A series of different Tetrascience-developed schemas that are designed for different life science systems.
Javscript Object Notation (JSON)
A data-interchange format that is easy for human beings to read and write and for machines to parse and generate. The IDS artifact is a JSON file.
Structured Query Language (SQL)
The language used to communicate with a database with relational database management systems. In the TDP, you can query data tables.
Describes the structure of the data. IDS is a schema that is designed to map instrument data to a standard format.
Updated 5 months ago