Data Lake

Tetra Data Lake is file based. All the entities are files. There are three categories of files stored in the Data Lake: raw, ids and processed.

Raw File

Unprocessed files, such as the ones generated by instruments or the ones uploaded to the Egnyte by the CROs.
Raw files are only searchable by filename and file metadata, such as creation time, but cannot be searched by their content.

IDS File

What is TetraScience IDS?

You can view the schemas on TetraScience platform.

You can list or find schemas by using the List schemas API, which will provide you with the following information for each IDS:

The associated JSON schema
The associated Elasticsearch mapping and also fields that are not indexed in Elasticsearch. If certain properties are included in nonSearchableFields, it means that those properties will NOT be returned in the Search files or Search files via Elasticsearch query API, but you can get it back via the retrieve a file API.

Elasticsearch File Schema

Every file, no matter what category it is, is indexed by Elasticsearch DSL. This is the full schema for each file entity within the Data Lake looks like the following:

fileId: (uuid) required for all file categories
traceId: (uuid) required for all file categories
rawFileId: (uuid) PROCESSED or IDS files only
orgSlug: (string) required for all file categories
category: (string) RAW, IDS, PROCESSED (case sensitive)
idsType: (string) ids files only, type of the IDS, every parser schema has its own type.
idsTypeVersion: (string)
createdAt: (timestamp) Unix time
outdated: (bool) true|false
source: (object)
- id: (uuid) sourceId
- name: (string) source name
- type: (string) source type
- egnyte: (object) for files imported via Egnyte integration only
  - integrationId: (uuid)
  - fileName: (string)
  - filePath: (string)
  - groupId: (string) files with the same groupId are different versions of the same file, each of the them will have different entryId or versionId
  - entryId: (string) this is Egnyte file's versionId
  - url: (string)
  - versionId: (string)
  - versionNum: (string)
  - lastModifiedTime: (string)
  - checksum: (string)
- datapipeline: (object) for files created inside datapipeline only
  - integrationId: (uuid)
  - workflowId: (uuid)
  - pipelineId: (uuid)
  - taskExecutionId: (uuid)
  - inputFileId: (uuid)
  - masterScript: (string)
  - taskSlug: (string)
  - taskScript: (string)
- box: (object) for files imported from box only
  - integrationId: (uuid)
  - fileId: (uuid)
- datahub: (object) for files created by datahub only
  - integrationId: (uuid)
- dotmatics: (object) for files created by dotmatics only
  - integrationId: (uuid)
- email: (object) for files created from email integrations
  - integrationId: (uuid)
  - from: (string): email sent from
  - to: (string): the recipient email
  - host: (string): the email server set up in the email integration
  - hostPort: (string): the email server port set up in the email integration
  - userName: (string): the account that the email integration is using to access emails in the inbox
file: (object)
- path: (string)
- bucket: (string)
- version: (string)
- size: (int)
- checksum: (string)
- type: (string) e.g. "json"
data: (object) content of IDS files, for ids files only. It will follow the Intermediate Data Schema (IDS). Be aware that the the nonSearchableFields will not be returned here, since they are not indexed by Elasticsearch.
tags: (array of string) tags added to the files
metadata: (object) key value pairs

Elasticsearch Mapping Rules

For IDS files, since the content on the file varies based on specific data source, we enforced the following dynamic mapping rules to achieve flexible and consistant data indexing. To read more about Elasticsearch mapping:

const mapping = {
  dynamic_templates: [
    {
      // This dynamic mapping rule maps all "string" to "keyword"
      // and create a field of type "text"
      string2keyword: {
        match_mapping_type: 'string',
        mapping: {
          type: 'keyword',
          fields: {
            text: {
              type: 'text',
            },
          },
        },
      },
    },
    {
      // Elasticsearch will automatically detect integer match it to type "long"
      // For non-integer, Elasticsearch will automatically match it to type "double"
      // this dynamic mapping rule maps all numerical value to "double"
      long2double: {
        match_mapping_type: 'long',
        mapping: {
          type: 'double',
        },
      },
    },
    {
      // For custom fields, try to map them to double
      // If it is a string, ignore malformed value and map them to keyword and text in the subfield
      customField: {
        path_match: '*.custom_field.*',
        mapping: {
          type: 'double',
          ignore_malformed: true,
          fields: {
            keyword: {
              type: 'keyword'
            },
            text: {
              type: 'text'
            }
          }
        }
      }
    },
  ],
}

Allotrope Data Format

The importance of data standardization has been recognized by the community and Allotrope Foundation was found by enterprise pharmaceutical companies to revolutionize the way we acquire, share and gain insights from scientific data, through a community and the framework for standardization, linked data and HDF5.

However, modeling the data coming from various instrument vendors in an ADF file is challenging and can be time-consuming, especially the process to create a semantically accurate Data Description, which contains not only the data itself, but also the semantic meanings and relationships.

Until the Allotrope Data Model (ADM) and its tooling are ready, we will not be able to create Data Description in production due to the lack of consensus around the shape of the graph contained within it and also the lack of tooling to make the ADF generation efficient enough for production.

In response to this challenge, multiple Allotrope member companies and Allotrope Partner Network (APN) companies, including TetraScience, have adopted a process / strategy to execute on early Allotrope-related projects, in order to fulfill user requirement and realize immediate business value.

The key concepts that TetraScience has adopted are Pre-ADF and IDS.

IDS stands for “Intermediate Data Schema” and is a JSON schema. We use that to define the structure of a human- and machine-readable JSON file, which we call “IDS JSON”.

This JSON file includes all the scientific data and metadata which can be meaningfully captured from the instruments via Tetra Data Platform.

The IDS JSON is meant to capture every possible piece of information from the data source that the final ADF file will possibly need, while also being easier to view, search and integrate with web application as the Allotrope Data Model (ADM) matures. IDS is also data source specific. There is an IDS for each major class of scientific data. The classification typically maps to the instrument classification since different instruments produce different kinds of data.

Pre-ADF file is an Allotrope file created by the official ADF API, containing an accurate Data Cube (there is typically much less debate around what should go into the data cube, usually temporal data, chromatograms and etc.), an accurate Data Package that includes the IDS JSON and/or any relevant raw file, and an empty Data Description (there will still be some triples generated by the Allotrope API while interfacing with Data Cube and Data Package, but we will not actively create any triples and insert them into Data Description).

This approach embeds the IDS JSON inside pre-ADF file as part of the Data Package, such that the pre-ADF itself contains all the information needed to be transformed into an official ADF file when the ADM matures. TetraScience Data Lake, a component of Tetra Data Platform, leverages IDS JSON as its native data format to provide early adopters of Allotrope the ability to search scientific data, answer important business questions and integrating the instrument data with applications such as ELN and visualization software, while waiting for AFO/ADM to be released or matures.

This particular strategy also ensures that we have a viable approach to transform pre-ADF to ADF.
TetraScience will leverage its partnership with instrument manufacturers, internal science/instrument SME, active participation within Allotrope and its working groups, technologies such as JSON-LD to create a pre-ADF-to-AFO mapping, anticipating the right ontology/semantics to use during the future pre-ADF to ADF transformation.