TDP Athena SQL Table Structure

Suggest Edits

In the Tetra Data Platform (TDP), multiple Amazon Athena tables are created for each Intermediate Data Schema (IDS) that you're using, based on the sections (partitions) within each IDS. Knowing how these Athena tables are structured can help you find specific data when running SQL queries.

🚧
IMPORTANT
If you're using a third-party tool to query Athena tables, you must declare which database that you want to connect to up front.
To declare your database up front, make sure that you use the table names as they appear in the TDP. If you don't declare the database up front, then you must add the database name in your query in the following format: org_slug.tablename (for example: demo_uat.lcuv_empower_v2_injection).

How IDS Files are Converted to Athena Tables

For each IDS that an organization uses, the following Athena tables are stored in a primary database within the Tetra Data Lake:

IDS-Specific Tables

Information for each IDS is stored in the following Athena tables:

root table: Contains top-level values that are specific to an IDS. Each IDS has one root table.
Subtables: Contain values that are specific to an IDS. Subtables link back to other subtables and the root table of an IDS.
(For some IDSs only) datacubes and datacubes_data tables: Contain each file's metadata and each measure’s name and unit as well as each dimension’s name and unit, respectively. Not all instruments and their associated IDSs create datacubes and datacubes_data tables.

File Record Tables

Records for each file uploaded to the Data Lake along with each file’s associated attributes (for example, labels) are stored in the following Athena tables:

file_info_v1 table: Contains each file's platform metadata. This information can also be found by using the API endpoints and in Elasticsearch. For example, these tables include file IDs and paths as well as file versions, sources, and more.

📘
NOTE
A metadata_v1 table that's on a deprecation path also contains some of this same information. For more information, see Athena metadata_v1 Table Deprecation.

file_attribute_v1 table: Contains all of the attributes (labels, metadata, and tags) that are associated with each file.

For more information about the values that are recorded in each table, see Athena Table Values.

Primary Database Names

Each organization’s primary database name is the normalized organization slug. The normalization process occurs when the database is automatically created.

For example, if the org slug is pharmacorp-uat, then the normalized primary database name changes to pharmacorp_uat.

For more information about org slugs, see What Is a Slug?

📘
Aggregated Databases
As of TDP version 3.0, aggregated databases are no longer used.

Athena Table Conversion Examples

📘
NOTE

For improved readability, select the following diagrams to increase their size.

For simplicity, the <IDS name>_<IDS major version>_ prefix is omitted from each of the following example table names.

Normal Table Conversion Example (Without Datacubes)

The following diagram shows an example of how an IDS file’s data is converted into Athena tables (without datacubes):

Datacubes Table Conversion Example

The following diagram shows an example of how IDS datacubes are converted into Athena tables:

Athena Table Structure

Each array of objects defined in an IDS has an associated Athena table. When these tables are automatically generated, the following occurs:

Each key whose value isn't an array of objects or arrays becomes a column in the root table.
Keys that contain an array of objects or arrays become separate tables and are linked to the parent table by using the parent_uuid.
Keys that contain an array of literals become columns in the table.

Athena Table Name Conventions

All Athena database, table, and column names are normalized. Because of the character restrictions required by Amazon Simple Storage Service (Amazon S3) and Amazon Athena, all non-alphanumeric characters are converted to underscores (_). Amazon S3 and Athena still allow some special characters, but TetraScience converts all special characters to underscores for simplicity and consistency.

Each Athena table name uses the following format:
<IDS name>_<IDS major version>_<top-level field name>_<child field name 1>_<child field name 2>_…

📘
NOTE
Private client namespace tables have aclient_<orgSlug>_ value added to the beginning of the table name. For example, a table named ms_masslynx_v2_root in the common namespace would be client_<orgSlug>_ms_masslynx_v2_root in a private namespace.

Root Table Names

The name of the root table is defined in each IDS's athena.jsonschema file.

Root tables are important to identify, because they typically contain a primary key and foreign keys that are in other tables. These keys are essential when defining your SQL queries with JOIN clauses.

Root Table Name Example

For example, a table named: ms_masslynx_v2_root includes the following values:

IDS Name = ms-masslynx
IDS Version = v2
Root Field Name = root

In this example, the root table is named root. In some cases, the root table may be named differently. For example, the root field name for an Empower table is injection, because injection is the root field in the IDS. In this case, the name of the table would be lcuv_empower_v2_injection.

Athena Table Values

The following information is available in each Athena table that's created for an IDS.

root Table Values

The values listed in each IDS's root table are unique, and are based on each organization's specific use case.

root Table Column Name Examples

The following example shows column names in a root table:

file_info_v1 Table Values

📘
NOTE
To display the file information that's in the file_info_v1 table in the TDP user interface, see View JSON File Details.

The following information displays as columns in file_info_v1 tables:

Column Name	Type	Description
`file_id`	String	Shows the file's unique ID, which can be joined with the `file_id` field in the `file_attribute_v1` table.
`file_path`	String	This value is determined by the TDP component that uploaded the file. For more information, see the documentation for the component that you're using.
`org_slug`	String	Shows the TDP organization slug
`trace_id`	String	Shows a unique ID that links files in the Data Lake that are derived from the same source file.
`file_key`	String	Shows the location of the file in the Amazon S3 bucket (S3 Object Key) in the following format: `{orgSlug}/{sourceId}/{category}/{filePath\}` Note:* The `filePath\*` variable is determined by the TDP component that uploaded the file. For more information, see the documentation for the component that you’re using.
`bucket`	String	Shows the Amazon S3 bucket that the file is stored in
`version`	String	Shows the file version identifier
`size`	BigInt	Shows the size of the file in Amazon S3 in bytes
`checksum`	String	Shows the Amazon S3 checksum value that's used to verify the integrity of data that you uploaded or downloaded Note: For more information, see Checking object integrity in the AWS Documentation.
`file_s3_content_encoding`	String	Shows the form of the Amazon S3 metadata content encoding (for example, gzip or UTF-8)
`file_s3_content_type`	String	Shows the content type that's recorded in the file's Amazon S3 metadata
`encoded_path`	String	Shows the sanitized version of the file path (URL encoded) to avoid errors and to provide standardize naming in Athena CSV files Note: Special characters in the Athena CSV file name can result in AWS Athena query failure.
`raw_md5_checksum`	String	Shows the MD5 checksum value calculated at ingestion time, before any potential compression for file integrity comparison
`created_at`	Timestamp	Shows the date and time when the file was uploaded to the Data Lake
`category`	String	Shows the TDP file category: RAW, IDS, or PROCESSED
`source_type`	String	Shows the type of source that generated the file
`source_name`	String	Shows the name of the source that generated the file
`source_id`	String	Show's the source's ID string
`input_pipeline_file_id`	String	Shows the file ID of the RAW file that triggered the pipeline
`ids`	String	Shows the IDS name
`ids_slug`	String	Shows the IDS slug
`ids_version`	String	Shows the IDS version
`ids_namespace`	String	Shows the IDS namespace
`trace`	String	Shows the serialized JSON string that contains all of the file's non-platform metadata
`os_created_time`	Timestamp	(For Tetra File-Log Agent files only) Shows the date and time when the file was created by the data source (not when the file is uploaded to the TDP)
`os_last_modified_time`	Timestamp	(For Tetra File-Log Agent files only) Shows the date and time when the file was last modified from the data source
`os_size_on_disk`	BigInt	(For Tetra File-Log Agent files only) Shows the size of the file in the original source system in bytes
`partition_path`	String	Shows the partition key Note: For more information, see Partitions.
`record_created_at`	Timestamp	Shows the date and time that the Athena CSV file was created and added to the `file_info_v1` table

📘
NOTE
The following information also displays as columns in the metadata_v1 table, which is on a deprecation path:

file_id

file_path

ids_type

ids_version

file_attribute_v1 Table Values

The following information displays as columns in file_attribute_v1 tables:

Column Name	Type	Description
`file_id`	String	Shows the file's ID number from the Amazon S3 bucket, which can be joined with the `file_id` field in the `file_info_v1` table
`type`	String	Shows the attribute type: Label, Metadata, or Tag
`key`	String	(Not included for tags) Shows the key in an attribute's `key:value` pair
`value`	String	Shows the value in an attribute's `key:value` pair (or the tag value for tags)

datacubes and datacubes_data Table Values

📘
NOTE
Not all IDSs generate datacubesand datacubes_data tables.

Each IDS that produces datacubes tables creates two different datacubes tables:

<IDS name>_<IDS major version>_datacubes
<IDS name>_<IDS major version>_datacubes_data

datacubes Table Values

The datacubes table contains all of a file's metadata and the following:

Each measure’s name and unit
Each dimension’s name and unit

Each measure’s name and unit along with each dimension’s name and unit becomes an individual column. How the columns are generated depends on the number of measures and dimensions defined in the IDS file.

The following example shows a datacubes table:

datacubes_data Table Values

The datacubes_data table stores measure and dimension values. Similar to name and unit, each measure and dimension value becomes an individual column.

The following example shows a datacubes_data table:

Best Practices for Working with Athena Tables in the TDP

When running SQL queries in the TDP, it's a best practice to use partitions to speed up query runtimes. Also, make sure that you use UNION clauses to avoid returning duplicate results.

Use Partitions

To speed up query time, you can specify a partition in athena.json, which directs Amazon S3 to store different types of IDS data into groups. Ideally, these groups are the ones that you would use in your where clause to filter data and speed up the query process.

When creating partitions, keep in mind the following:

Don't select a partition key that is too granular (for example, do not use an UUID). An effective partition key should divide the data into groups into similar sizes. For example, an effective partition key for Empower is project_name.
Because of the character restrictions on the partition value, all non-alphanumeric characters in the value are converted to double underscores (__). For example, if one of the partition values is Demo\Demo_Clone_2019, which contains the non-alphanumeric forward slash, this value will be converted to Demo__Demo_Clone_2019.

📘
NOTE
When you use a partition, Athena will add a new virtual column to the table. The column name will be the partition key, and the value will be the values of this key. However, this extra column is not added into the original data. If you need to add a new partition to the data model, then you must get the latest version.

Use `UNION` Clauses in Your SQL Queries

If you change the version of the IDS that you are using in a pipeline, new version-specific tables are created, which include the processing result. If you query the results, it may be possible that you see duplicate information from results generated by the old and new versions of IDS.

To avoid having duplicate results, it's recommended that you use the UNION clause. The UNION clause allows you to combine two separate SELECT statements to produce a result set as a union of both the SELECT statements.

When using UNION clauses in your SQL statements, keep in mind the following:

To use the UNION clause, the fields used in both SELECT statements must be in the same order, have the same number, and have the same data type.
When using the UNION clause, keep in mind that the UNION clause produces distinct values in the result set.

Example SQL Statement that Uses a UNION Clause

SELECT column_name(s) FROM lcuv_empower_v8_injection UNION SELECT column_name(s) FROM lcuv_empower_v9_injection;

Result set consists of distinct values.

📘
NOTE
In the result set produced by the previous example statement, lcuv_empower_v8_injection is the table generated from version 8.X.Y IDSs. lcuv_empower_v9_injection is the table generated from version 9.X.Y IDSs.

Updated 9 months ago

TDP Athena SQL Table Structure

🚧
IMPORTANT

How IDS Files are Converted to Athena Tables

IDS-Specific Tables

File Record Tables

📘
NOTE

Primary Database Names

📘
Aggregated Databases

Athena Table Conversion Examples

📘
NOTE

Athena Table Structure

Athena Table Name Conventions

📘
NOTE

Root Table Names

Athena Table Values

root Table Values

file_info_v1 Table Values

📘
NOTE

📘
NOTE

file_attribute_v1 Table Values

datacubes and datacubes_data Table Values

📘
NOTE

Best Practices for Working with Athena Tables in the TDP

Use Partitions

📘
NOTE

Use `UNION` Clauses in Your SQL Queries

📘
NOTE

🚧IMPORTANT

How IDS Files are Converted to Athena Tables

IDS-Specific Tables

File Record Tables

📘NOTE

Primary Database Names

📘Aggregated Databases

Athena Table Conversion Examples

📘NOTE

Athena Table Structure

Athena Table Name Conventions

📘NOTE

Root Table Names

Athena Table Values

root Table Values

file_info_v1 Table Values

📘NOTE

📘NOTE

file_attribute_v1 Table Values

datacubes and datacubes_data Table Values

📘NOTE

Best Practices for Working with Athena Tables in the TDP

Use Partitions

📘NOTE

Use UNION Clauses in Your SQL Queries

📘NOTE

🚧
IMPORTANT

📘
NOTE

📘
Aggregated Databases

📘
NOTE

📘
NOTE

📘
NOTE

📘
NOTE

📘
NOTE

📘
NOTE

Use `UNION` Clauses in Your SQL Queries

📘
NOTE