Intermediate Data Schema (IDS) Overview
The "Intermediate Data Schema" (IDS) designed by TetraScience in collaboration with instrument manufacturers, scientists, and customers, is a schema that is applied to raw instrument data or report files. This schema is used to map vendor-specific information (like the name of a field) to vendor-agnostic information. The IDS standardizes naming, data type (whether the field is a string, an integer or a date), data range (for example, something needs to be a positive number) and data hierarchy.
By doing this, the IDS harmonizes different data sets in Life Science industry, such as instrument data, CRO assay data and software data. This allows Life Sciences companies to consume the data in their applications, build searches and aggregations and feed the data into visualization/analysis software seamlessly because the IDS generated JSON files are predictable, consistent and vendor-agnostic.
Raw data files are uploaded to the data lake and are then parsed by TetraScience data pipelines, which among other things, generates a JSON file using the IDS designed for that specific instrument as a guide. During processing, the data is validated and search indexes are also generated. The newly generated JSON file thus is an IDS JSON. The IDS JSON is a JSON representation of the RAW data. Other files are generated as well, such as logs and search mappings. The IDS JSON, as well as the other files, can be quickly and easily viewed in the TDP.
Once IDS JSONs are generated, you can query the data in TDP using SQL from the TDP or a tool of your choice, or with an API call using the tool of your choice. If you have questions as you review your data, you can refer to IDS examples and readme files in the TDP so that you can reference the structure to understand how the data maps to the original RAW file.
IDS Terminology
The following table lists terminology related to IDSs.
Term | Description |
---|---|
Artifact | Artifacts are versioned instances of IDSs, protocols, or task scripts. All artifacts have the following attributes: • Belongs to a namespace, such as common, client-xx, or private-xx • Has a type or slug, such as lcuv-empower (the type is the ids type, such as akta) • Has a version, such as v. 3.1.0 • Has associated files, such as a readme file |
Data Cube | A multidimensional array of values. IDS data is stored in one or more data cubes. |
Intermediate Data Schema (IDS) | A series of different Tetrascience-developed schemas that are designed for different life science systems. |
Javscript Object Notation (JSON) | A data-interchange format that is easy for human beings to read and write and for machines to parse and generate. The IDS artifact is a JSON file. |
Structured Query Language (SQL) | The language used to communicate with a database with relational database management systems. In the TDP, you can query data tables. |
Schema | Describes the structure of the data. IDS is a schema that is designed to map instrument data to a standard format. |
Updated 11 months ago