TDP Athena SQL Table Structure Overview

Before you determine which tables to query in the Tetra Data Platform (TDP), TetraScience recommends that you understand the structure of the SQL tables (which contain your file's data). This page describes the structure and naming conventions used for the database and tables.

📘

Normalization of Names

All Athena database, table, and column names are normalized. Because of the character restriction of S3 and Athena, all non-alphanumeric characters are converted to underscores. S3 and Athena still allow some special characters; however for simplicity, TetraScience converts all special characters to underscores.

Primary Database Name

Each organization’s primary database name is the normalized organization slug (organization short name is org slug). The normalization process occurs when the database is automatically created. For example, if the org slug was “pharmacorp-uat”, then the normalized form changes to “pharmacorp_uat“.

📘

Aggregated Databases

As of Tetra Data Platform (TDP) version 3.0, aggregated databases are no longer used.

Table Structure

In the TDP, Athena tables are created based on the [TetraScience IDS] (https://developers.tetrascience.com/docs/introduction-to-tetrascience-intermediate-data-schema) designed for each specific instrument or use case.

Table Guidelines

The automatic creation of Athena tables from a particular IDS follows these general guidelines:

  • Each key whose value is not an array of objects or an array of arrays -> becomes a column in the root table.
  • Keys that contain an array of objects or an array of arrays -> become separate tables and are linked to the parent table using the parent_uuid.
  • Keys that contain an array of literals -> become columns in the table.

You should be aware of these guidelines because they may help you identify the data you are interested in reviewing.

This diagram shows an example of how an IDS file’s data is converted into Athena tables:

📘

Diagram Notes

  • For improved readability, click on the diagram to increase its size.
  • For simplicity, the <IDS name>_<IDS version>_ prefix is omitted for each table name.
1036

IDS to Athena table diagram

Table Name Conventions

The name of the root table is defined in athena.json. Each table name uses this format:
<IDS name>_<IDS version>_<root field name>_<child field name 1>_<child field name 2>_…

📘

NOTE

Private client namespace tables have client_<orgSlug>_ prepended to the beginning of the table name. For example, for a table named ms_masslynx_v2_root in the common namespace, the same table in the private namespace for a company with the org slug, "pharmacorp" would be client_pharmacorp_ms_masslynx_v2_root.

Root tables are important to identify because they typically contain a primary key and foreign keys that are in other tables. Root tables are essential to using joins when you define your queries.

For example, for a table named: ms_masslynx_v2_root, where:

  • IDS Name = ms-masslynx
  • IDS Version = v2
  • Root Field Name = root

In this example, the root table is named root. However in some cases, the root table may be named differently. For example, the root field name for Empower tables is injection because injection is the root field in the IDS. Therefore, the name of the table is: lcuv_empower_v2_injection.

Column Names

You can use the Tetra Data Platform (TDP) to view the columns in each table and the table datatype. For column names, all non-alphanumeric characters are converted to underscores because of the character restriction of S3 and Athena. S3 and Athena still allow some special characters; however for simplicity, TetraScience converts all special characters to underscores.

This example shows column names in a root table:

2530

root table example

This example shows column names in a method_instrument_detector_channels table:

1508

method_instrument_detector_channels table example

📘

Third-Party Tools and Athena

If you are using a third-party tool to query Athena tables, you may need to declare which database to connect to upfront.

  • If you do this, then you can use the table names as they appear in the system.
  • If do do not declare the database upfront, then you must add the database name in your query, for example: demo_uat.lcuv_empower_v2_injection.

Datacube Tables

Depending on the instrument you are using, you may or may not have datacube tables in your SQL table list.

datacubes table creation is handled differently than general Athena table creation. There are only two datacube tables:

  • <IDS name>_<IDS version>_datacubes
  • <IDS name>_<IDS version>_datacubes_data

For example, Empower v2 datacubes tables are:

  • lcuv_empower_v2_datacubes
  • lcuv_empower_v2_datacubes_data

This diagram shows an example of how IDS datacubes are converted into Athena tables:

📘

Diagram Notes

  • For improved readability, click on the diagram to increase its size.
  • For simplicity, the <IDS name>_<IDS version>_ prefix is omitted for each table name.
961

IDS datacubes to Athena table

This example shows a datacubes table:

2312

datacubes table

The datacubes table contains all metadata and:

  • Each measure’s name and unit
  • Each dimension’s name and unit

Each measure’s name and unit, and each dimension’s name and unit, becomes an individual column. How the columns are auto-generated depends on the number of measures and dimensions defined in the IDS file.

This example shows a datacubes_data table:

1038

datacubes_data table

The datacubes_data table stores measure and dimension values. Similarly to name and unit, each measure and dimension value becomes an individual column.

Metadata Tables

TetraScience creates platform metadata from all files in the Data Lake, and stores it in the metadata_v1 table in your SQL table list.

1167

metadata_v1 table

This information displays as columns in the metadata_v1 table:

  • file_id
  • file_path
  • ids_type
  • ids_version

To display the metadata from the Tetra Data Platform (TDP) user interface:

  1. Click Search Files from the left side menu.
  2. Select a JSON file to review.
  3. Click View JSON Details to open the file and view its metadata from the Elasticsearch document dialog.

Partitions

To speed up query time, you can specify a partition in athena.json where S3 stores the data differently. A partition divides the data into groups. And ideally, these groups are the ones you would use in your where clause to filter data and speed up the query process.

Do not select a partition key that is too granular; for example do not use an UUID. An effective partition key should divide the data into groups into similar sizes. For example, an effective partition key for Empower is: "project name".

Due to the character restrictions on the partition value, all non-alphanumeric characters in the value are converted to double underscores (__).
For example:

  • The 'project_name' column contains the original value.
  • Because 'project_name' was used as the partition key, all of its values are normalized in the project column.
1044

Partition

📘

Athena and Partitions

When you use a partition, Athena will add a new virtual column to the table. The column name will be the partition key, and the value will be the values of this key. However, this extra column is not added into the original data.

If you need to add a new partition to the data model, then you must get the latest version.

Use UNION Clause to Avoid Duplicate Results

If you change the version of the IDS that you are using in a pipeline, new version-specific tables are created which include the processing result. If you query the results, it may be possible that you see duplicate information from results generated by the old and new versions of IDS.

To avoid having duplicate results and return a distinct set of values, TetraScience recommends that you use the UNION clause. The UNION clause enables you to combine two separate SELECT statements to produce a result set as a union of both the SELECT statements. To use the UNION clause:

  • The fields used in both SELECT statements must be in same order, have the same number, and have the same data type.
  • The UNION clause produces distinct values in the result set
SELECT column_name(s) FROM lcuv_empower_v8_injection UNION SELECT column_name(s) FROM lcuv_empower_v9_injection;

Result set consists of distinct values.

where:
lcuv_empower_v8_injection = table generated by the first IDS version 8
lcuv_empower_v9_injection = table generated by the second IDS version 9