Data Lake

Tetra Data Lake is file based. All the entities are files. There are three categories of files: raw, ids, and processed.

RAW

RAW files are unprocessed files, such as the ones generated by instruments or the ones uploaded to the Egnyte by the CROs.

You can search for RAW files by filename, file metadata (such as creation time) and attributes (metadata, tags, and labels).

IDS

IDS files are schemas that are applied to raw instrument data or report files. These schemas are used to map vendor-specific information (like the name of a field) to vendor-agnostic information. To learn more, see Intermediate Data Schema (IDS) Overview.

To search for an IDS and view its visualization and associated artifacts, see View Artifact Information.

You can also list or find schemas by using the List schemas API, which returns the following information for each IDS:

  • The associated JSON schema
  • The associated Elasticsearch mapping and also fields that are not indexed in Elasticsearch.

📘

NOTE

If certain properties are included in nonSearchableFields, it means that those properties won't be returned in the Search Files via Elasticsearch Query Language API. You can get the properties included in nonSearchableFields back by using the Retrieve a File API.

Processed

Processed files are derived from a RAW file and generated from a pipeline. For example, a processed file could be a .zip file that was unzipped into many smaller files, or image thumbnails from a large microscopy image.

Elasticsearch File Schema

Every file is indexed by Elasticsearch DSL, no matter what category the file is.

The following is the full schema that is indexed by Elasticsearch available for more complex queries via the Search files via Elasticsearch query API. To retrieve a file for using it’s data, use the retrieve a file API:

  • fileId: (uuid) required for all file categories
  • traceId: (uuid) required for all file categories
  • rawFileId: (uuid) PROCESSED or IDS files only
  • orgSlug: (string) required for all file categories
  • category: (string) RAW, IDS, PROCESSED (case sensitive)
  • idsType: (string) ids files only, type of the IDS, every parser schema has its own type.
  • idsTypeVersion: (string)
  • createdAt: (timestamp) Unix time
  • outdated: (bool) true|false
  • source: (object)
    • id: (uuid) sourceId
    • name: (string) source name
    • type: (string) source type
    • datapipeline: (object) for files created inside data pipeline only
      • integrationId: (uuid)
      • workflowId: (uuid)
      • pipelineId: (uuid)
      • taskExecutionId: (uuid)
      • inputFileId: (uuid)
      • masterScript: (string)
      • taskSlug: (string)
      • taskScript: (string)
  • file: (object)
    • path: (string)
    • bucket: (string)
    • version: (string)
    • size: (int)
    • rawMD5checksum: (string) (Note: For files processed by FLA v4.3 and higher)
    • type: (string) e.g. "json"
    • osCreatedTime: (string)
    • osLastModifiedTime: (string)
    • osFilePath: (string) (Note: For files processed by FLA v4.3 and higher)
    • osFolderPath: (string) (Note: For files processed by FLA v4.3 and higher)
    • osSizeOnDisk: (int)
    • osCreatedUser: (string) (Note: For files processed by FLA v4.3 and higher)
  • data: (object) content of IDS files, for ids files only. It will follow the Intermediate Data Schema (IDS). Be aware that the the nonSearchableFields will not be returned here, since they are not indexed by Elasticsearch.
  • tags: (array of string) tags added to the files
  • metadata: (object) key value pairs
  • labels: array of label object

🚧

IMPORTANT

In TDP v3.5.0, the nested trace object field was removed from Elasticsearch file schemas for files processed by the Tetra File-Log Agent v4.3 and higher. The trace object field had included the following information, which is now part of the file object field:

  • osFilePath
  • osFolderPath
  • osCreatedUser

It is not recommended that customers use the previous trace object attributes when configuring integrations, because those attributes can change. Customers should use these new, immutable file object fields instead.

Elasticsearch Mapping Rules

The following dynamic mapping rules are applied to achieve flexible and consistent data indexing. These mapping rules are applied because the content in each IDS file varies based on its specific data sources.

For more information about Elasticsearch mapping, see the following:

const mapping = {
  dynamic_templates: [
    {
      // This dynamic mapping rule maps all "string" to "keyword"
      // and create a field of type "text"
      string2keyword: {
        match_mapping_type: 'string',
        mapping: {
          type: 'keyword',
          fields: {
            text: {
              type: 'text',
            },
          },
        },
      },
    },
    {
      // Elasticsearch will automatically detect integer match it to type "long"
      // For non-integer, Elasticsearch will automatically match it to type "double"
      // this dynamic mapping rule maps all numerical value to "double"
      long2double: {
        match_mapping_type: 'long',
        mapping: {
          type: 'double',
        },
      },
    },
    {
      // For custom fields, try to map them to double
      // If it is a string, ignore malformed value and map them to keyword and text in the subfield
      customField: {
        path_match: '*.custom_field.*',
        mapping: {
          type: 'double',
          ignore_malformed: true,
          fields: {
            keyword: {
              type: 'keyword'
            },
            text: {
              type: 'text'
            }
          }
        }
      }
    },
  ],
}