TDP Athena SQL Table Structure

In the Tetra Data Platform (TDP), multiple Amazon Athena tables are created for each Intermediate Data Schema (IDS) that you're using, based on the sections (partitions) within each IDS. Knowing how these Athena tables are structured can help you find specific data when running SQL queries.

🚧

IMPORTANT

If you're using a third-party tool to query Athena tables, you must declare which database that you want to connect to up front.

To declare your database up front, make sure that you use the table names as they appear in the TDP. If you don't declare the database up front, then you must add the database name in your query in the following format: org_slug.tablename (for example: demo_uat.lcuv_empower_v2_injection).

How IDS Files are Converted to Athena Tables

For each IDS that an organization uses, the following Athena tables are stored in a primary database within the Tetra Data Lake:

IDS-Specific Tables

Information for each IDS is stored in the following Athena tables:

  • root table: Contains top-level values that are specific to an IDS. Each IDS has one root table.
  • Subtables: Contain values that are specific to an IDS. Subtables link back to other subtables and the root table of an IDS.
  • (For some IDSs only) datacubes and datacubes_data tables: Contain each file's metadata and each measure’s name and unit as well as each dimension’s name and unit, respectively. Not all instruments and their associated IDSs create datacubes and datacubes_data tables.

File Record Tables

Records for each file uploaded to the Data Lake along with each file’s associated attributes (for example, labels) are stored in the following Athena tables:

  • file_info_v1 table: Contains each file's platform metadata. This information can also be found by using the API endpoints and in Elasticsearch. For example, these tables include file IDs and paths as well as file versions, sources, and more.

📘

NOTE

A metadata_v1 table that's on a deprecation path also contains some of this same information. For more information, see Athena metadata_v1 Table Deprecation.

  • file_attribute_v1 table: Contains all of the attributes (labels, metadata, and tags) that are associated with each file.

For more information about the values that are recorded in each table, see Athena Table Values.

Primary Database Names

Each organization’s primary database name is the normalized organization slug. The normalization process occurs when the database is automatically created.

For example, if the org slug is pharmacorp-uat, then the normalized primary database name changes to pharmacorp_uat.

For more information about org slugs, see Slugs.

📘

Aggregated Databases

As of TDP version 3.0, aggregated databases are no longer used.

Athena Table Conversion Examples

📘

NOTE

  • For improved readability, select the following diagrams to increase their size.
  • For simplicity, the <IDS name>_<IDS major version>_ prefix is omitted from each of the following example table names.

Normal Table Conversion Example (Without Datacubes)

The following diagram shows an example of how an IDS file’s data is converted into Athena tables (without datacubes):

1036

IDS to Athena table diagram

Datacubes Table Conversion Example

The following diagram shows an example of how IDS datacubes are converted into Athena tables:

961

IDS datacubes to Athena table

Athena Table Structure

Each array of objects defined in an IDS has an associated Athena table. When these tables are automatically generated, the following occurs:

  • Each key whose value isn't an array of objects or arrays becomes a column in the root table.
  • Keys that contain an array of objects or arrays become separate tables and are linked to the parent table by using the parent_uuid.
  • Keys that contain an array of literals become columns in the table.

Athena Table Name Conventions

All Athena database, table, and column names are normalized. Because of the character restrictions required by Amazon Simple Storage Service (Amazon S3) and Amazon Athena, all non-alphanumeric characters are converted to underscores (_). Amazon S3 and Athena still allow some special characters, but TetraScience converts all special characters to underscores for simplicity and consistency.

Each Athena table name uses the following format:
<IDS name>_<IDS major version>_<top-level field name>_<child field name 1>_<child field name 2>_…

📘

NOTE

Private client namespace tables have aclient_<orgSlug>_ value added to the beginning of the table name. For example, a table named ms_masslynx_v2_root in the common namespace would be client_<orgSlug>_ms_masslynx_v2_root in a private namespace.

Root Table Names

The name of the root table is defined in each IDS's athena.jsonschema file.

Root tables are important to identify, because they typically contain a primary key and foreign keys that are in other tables. These keys are essential when defining your SQL queries with JOIN clauses.

Root Table Name Example

For example, a table named: ms_masslynx_v2_root includes the following values:

  • IDS Name = ms-masslynx
  • IDS Version = v2
  • Root Field Name = root

In this example, the root table is named root. In some cases, the root table may be named differently. For example, the root field name for an Empower table is injection, because injection is the root field in the IDS. In this case, the name of the table would be lcuv_empower_v2_injection.

Athena Table Values

The following information is available in each Athena table that's created for an IDS.

root Table Values

The values listed in each IDS's root table are unique, and are based on each organization's specific use case.

root Table Column Name Examples

The following example shows column names in a root table:

2530

Root table example

file_info_v1 Table Values

📘

NOTE

To display the file information that's in the file_info_v1 table in the TDP user interface, see View JSON File Details.

The following information displays as columns in file_info_v1 tables:

Column NameTypeDescription
file_idStringShows the file's unique ID, which can be joined with the file_id field in the file_attribute_v1 table.
file_pathStringThis value is determined by the TDP component that uploaded the file. For more information, see the documentation for the component that you're using.
org_slug StringShows the TDP organization slug
trace_idStringShows a unique ID that links files in the Data Lake that are derived from the same source file.
file_keyStringShows the location of the file in the Amazon S3 bucket (S3 Object Key) in the following format: {orgSlug}/{sourceId}/{category}/{filePath\*}

Note: The filePath\* variable is determined by the TDP component that uploaded the file. For more information, see the documentation for the component that you’re using.
bucketStringShows the Amazon S3 bucket that the file is stored in
versionStringShows the file version identifier
sizeBigIntShows the size of the file in Amazon S3 in bytes
checksumStringShows the Amazon S3 checksum value that's used to verify the integrity of data that you uploaded or downloaded

Note: For more information, see Checking object integrity in the AWS Documentation.
file_s3_content_encodingStringShows the form of the Amazon S3 metadata content encoding (for example, gzip or UTF-8)
file_s3_content_typeStringShows the content type that's recorded in the file's Amazon S3 metadata
encoded_pathStringShows the sanitized version of the file path (URL encoded) to avoid errors and to provide standardize naming in Athena CSV files

Note: Special characters in the Athena CSV file name can result in AWS Athena query failure.
raw_md5_checksumStringShows the MD5 checksum value calculated at ingestion time, before any potential compression for file integrity comparison
created_atTimestampShows the date and time when the file was uploaded to the Data Lake
categoryStringShows the TDP file category: RAW, IDS, or PROCESSED
source_typeStringShows the type of source that generated the file
source_nameStringShows the name of the source that generated the file
source_idStringShow's the source's ID string
input_pipeline_file_idStringShows the file ID of the RAW file that triggered the pipeline
idsStringShows the IDS name
ids_slugStringShows the IDS slug
ids_versionStringShows the IDS version
ids_namespaceStringShows the IDS namespace
traceStringShows the serialized JSON string that contains all of the file's non-platform metadata
os_created_timeTimestamp(For Tetra File-Log Agent files only) Shows the date and time when the file was created by the data source (not when the file is uploaded to the TDP)
os_last_modified_timeTimestamp(For Tetra File-Log Agent files only) Shows the date and time when the file was last modified from the data source
os_size_on_diskBigInt(For Tetra File-Log Agent files only) Shows the size of the file in the original source system in bytes
partition_pathStringShows the partition key

Note: For more information, see Partitions.
record_created_atTimestampShows the date and time that the Athena CSV file was created and added to the file_info_v1 table

📘

NOTE

The following information also displays as columns in the metadata_v1 table, which is on a deprecation path:

  • file_id
  • file_path
  • ids_type
  • ids_version

file_attribute_v1 Table Values

The following information displays as columns in file_attribute_v1 tables:

Column NameTypeDescription
file_idStringShows the file's ID number from the Amazon S3 bucket, which can be joined with the file_id field in the file_info_v1 table
typeStringShows the attribute type: Label, Metadata, or Tag
keyString(Not included for tags) Shows the key in an attribute's key:value pair
valueStringShows the value in an attribute's key:value pair (or the tag value for tags)

datacubes and datacubes_data Table Values

📘

NOTE

Not all IDSs generate datacubesand datacubes_data tables.

Each IDS that produces datacubes tables creates two different datacubes tables:

  • <IDS name>_<IDS major version>_datacubes
  • <IDS name>_<IDS major version>_datacubes_data

datacubes Table Values

The datacubes table contains all of a file's metadata and the following:

  • Each measure’s name and unit
  • Each dimension’s name and unit

Each measure’s name and unit along with each dimension’s name and unit becomes an individual column. How the columns are generated depends on the number of measures and dimensions defined in the IDS file.

The following example shows a datacubes table:

2312

datacubes table

datacubes_data Table Values

The datacubes_data table stores measure and dimension values. Similar to name and unit, each measure and dimension value becomes an individual column.

The following example shows a datacubes_data table:

1038

datacubes_data table

Best Practices for Working with Athena Tables in the TDP

When running SQL queries in the TDP, it's a best practice to use partitions to speed up query runtimes. Also, make sure that you use UNION clauses to avoid returning duplicate results.

Use Partitions

To speed up query time, you can specify a partition in athena.json, which directs Amazon S3 to store different types of IDS data into groups. Ideally, these groups are the ones that you would use in your where clause to filter data and speed up the query process.

When creating partitions, keep in mind the following:

  • Don't select a partition key that is too granular (for example, do not use an UUID). An effective partition key should divide the data into groups into similar sizes. For example, an effective partition key for Empower is project_name.
  • Because of the character restrictions on the partition value, all non-alphanumeric characters in the value are converted to double underscores (__). For example, if one of the partition values is Demo\Demo_Clone_2019, which contains the non-alphanumeric forward slash, this value will be converted to Demo__Demo_Clone_2019.

📘

NOTE

When you use a partition, Athena will add a new virtual column to the table. The column name will be the partition key, and the value will be the values of this key. However, this extra column is not added into the original data. If you need to add a new partition to the data model, then you must get the latest version.

Use UNION Clauses in Your SQL Queries

If you change the version of the IDS that you are using in a pipeline, new version-specific tables are created, which include the processing result. If you query the results, it may be possible that you see duplicate information from results generated by the old and new versions of IDS.

To avoid having duplicate results, it's recommended that you use the UNION clause. The UNION clause allows you to combine two separate SELECT statements to produce a result set as a union of both the SELECT statements.

When using UNION clauses in your SQL statements, keep in mind the following:

  • To use the UNION clause, the fields used in both SELECT statements must be in the same order, have the same number, and have the same data type.
  • When using the UNION clause, keep in mind that the UNION clause produces distinct values in the result set.

Example SQL Statement that Uses a UNION Clause

SELECT column_name(s) FROM lcuv_empower_v8_injection UNION SELECT column_name(s) FROM lcuv_empower_v9_injection;

Result set consists of distinct values.

📘

NOTE

In the result set produced by the previous example statement, lcuv_empower_v8_injection is the table generated from version 8.X.Y IDSs. lcuv_empower_v9_injection is the table generated from version 9.X.Y IDSs.