TDP Athena SQL Table Structure
In the Tetra Data Platform (TDP), multiple Amazon Athena tables are created for each Intermediate Data Schema (IDS) that you're using, based on the sections (partitions) within each IDS. Knowing how these Athena tables are structured can help you find specific data when running SQL queries.
Using Third-Party Tools and Athena
If you're using a third-party tool to query Athena tables, you might need to declare which database that you want to connect to up front.
To declare your database up front, make sure that you use the table names as they appear in the TDP. If you don't declare the database upfront, then you must add the database name in your query (for example:
demo_uat.lcuv_empower_v2_injection
).
How IDS Files are Converted to Athena Tables
For each IDS that an organization uses, the following Athena tables are stored in a primary database within the Tetra Data Lake:
IDS-Specific Tables
Information for each IDS is stored in the following Athena tables:
root
table: Contains top-level values that are specific to an IDS. Each IDS has one root table.- Subtables: Contain values that are specific to an IDS. Subtables link back to other subtables and the
root
table of an IDS. - (For some IDSs only)
datacubes
anddatacubes_data
tables: Contain each file's metadata and each measure’s name and unit as well as each dimension’s name and unit, respectively. Not all instruments and their associated IDSs createdatacubes
anddatacubes_data
tables.
File Record Tables
Records for each file uploaded to the Data Lake along with each file’s associated attributes (for example, labels) are stored in the following Athena tables:
file_info_v1
table: Contains each file's platform metadata. This information can also be found by using the API endpoints and in Elasticsearch. For example, these tables include file IDs and paths as well as file versions, sources, and more.
NOTE
A
metadata_v1
table that's on a deprecation path also contains some of this same information. For more information, see Athena metadata_v1 Table Deprecation.
file_attribute_v1
table: Contains all of the attributes (labels, metadata, and tags) that are associated with each file.
For more information about the values that are recorded in each table, see Athena Table Values.
Primary Database Names
Each organization’s primary database name is the normalized organization slug. The normalization process occurs when the database is automatically created.
For example, if the org slug is pharmacorp-uat, then the normalized primary database name changes to pharmacorp_uat.
For more information about org slugs, see What Is a Slug?
Aggregated Databases
As of TDP version 3.0, aggregated databases are no longer used.
Athena Table Conversion Examples
NOTE
- For improved readability, select the following diagrams to increase their size.
- For simplicity, the
<IDS name>_<IDS major version>_
prefix is omitted from each of the following example table names.
Normal Table Conversion Example (Without Datacubes)
The following diagram shows an example of how an IDS file’s data is converted into Athena tables (without datacubes):
Datacubes Table Conversion Example
The following diagram shows an example of how IDS datacubes are converted into Athena tables:
Athena Table Structure
Each array of objects defined in an IDS has an associated Athena table. When these tables are automatically generated, the following occurs:
- Each key whose value isn't an array of objects or arrays becomes a column in the
root
table. - Keys that contain an array of objects or arrays become separate tables and are linked to the parent table by using the
parent_uuid
. - Keys that contain an array of literals become columns in the table.
Athena Table Name Conventions
All Athena database, table, and column names are normalized. Because of the character restrictions required by Amazon Simple Storage Service (Amazon S3) and Amazon Athena, all non-alphanumeric characters are converted to underscores (_). Amazon S3 and Athena still allow some special characters, but TetraScience converts all special characters to underscores for simplicity and consistency.
Each Athena table name uses the following format:
<IDS name>_<IDS major version>_<top-level field name>_<child field name 1>_<child field name 2>_…
NOTE
Private client namespace tables have a
client_<orgSlug>_
value added to the beginning of the table name. For example, a table namedms_masslynx_v2_root
in the common namespace would beclient_<orgSlug>_ms_masslynx_v2_root
in a private namespace.
Root Table Names
The name of the root
table is defined in each IDS's athena.json
schema file.
Root tables are important to identify, because they typically contain a primary key and foreign keys that are in other tables. These keys are essential when defining your SQL queries with JOIN
clauses.
Root Table Name Example
For example, a table named: ms_masslynx_v2_root
includes the following values:
- IDS Name =
ms-masslynx
- IDS Version =
v2
- Root Field Name =
root
In this example, the root table is named root
. In some cases, the root table may be named differently. For example, the root field name for an Empower table is injection
, because injection
is the root field in the IDS. In this case, the name of the table would be lcuv_empower_v2_injection
.
Athena Table Values
The following information is available in each Athena table that's created for an IDS.
root Table Values
The values listed in each IDS's root table are unique, and are based on each organization's specific use case.
root Table Column Name Examples
The following example shows column names in a root
table:
file_info_v1 Table Values
NOTE
To display the file information that's in the
file_info_v1
table in the TDP user interface, see View JSON File Details.
The following information displays as columns in file_info_v1
tables:
Column Name | Type | Description |
---|---|---|
file_id | String | Shows the file's unique ID, which can be joined with the file_id field in the file_attribute_v1 table. |
file_path | String | This value is determined by the TDP component that uploaded the file. For more information, see the documentation for the component that you're using. |
org_slug | String | Shows the TDP organization slug |
trace_id | String | Shows a unique ID that links files in the Data Lake that are derived from the same source file. |
file_key | String | Shows the location of the file in the Amazon S3 bucket (S3 Object Key) in the following format: {orgSlug}/{sourceId}/{category}/{filePath\*} Note: The filePath\* variable is determined by the TDP component that uploaded the file. For more information, see the documentation for the component that you’re using. |
bucket | String | Shows the Amazon S3 bucket that the file is stored in |
version | String | Shows the file version identifier |
size | BigInt | Shows the size of the file in Amazon S3 in bytes |
checksum | String | Shows the Amazon S3 checksum value that's used to verify the integrity of data that you uploaded or downloaded Note: For more information, see Checking object integrity in the AWS Documentation. |
file_s3_content_encoding | String | Shows the form of the Amazon S3 metadata content encoding (for example, gzip or UTF-8) |
file_s3_content_type | String | Shows the content type that's recorded in the file's Amazon S3 metadata |
encoded_path | String | Shows the sanitized version of the file path (URL encoded) to avoid errors and to provide standardize naming in Athena CSV files Note: Special characters in the Athena CSV file name can result in AWS Athena query failure. |
raw_md5_checksum | String | Shows the MD5 checksum value calculated at ingestion time, before any potential compression for file integrity comparison |
created_at | Timestamp | Shows the date and time when the file was uploaded to the Data Lake |
category | String | Shows the TDP file category: RAW, IDS, or PROCESSED |
source_type | String | Shows the type of source that generated the file |
source_name | String | Shows the name of the source that generated the file |
source_id | String | Show's the source's ID string |
input_pipeline_file_id | String | Shows the file ID of the RAW file that triggered the pipeline |
ids | String | Shows the IDS name |
ids_slug | String | Shows the IDS slug |
ids_version | String | Shows the IDS version |
ids_namespace | String | Shows the IDS namespace |
trace | String | Shows the serialized JSON string that contains all of the file's non-platform metadata |
os_created_time | Timestamp | (For Tetra File-Log Agent files only) Shows the date and time when the file was created by the data source (not when the file is uploaded to the TDP) |
os_last_modified_time | Timestamp | (For Tetra File-Log Agent files only) Shows the date and time when the file was last modified from the data source |
os_size_on_disk | BigInt | (For Tetra File-Log Agent files only) Shows the size of the file in the original source system in bytes |
partition_path | String | Shows the partition key Note: For more information, see Partitions. |
record_created_at | Timestamp | Shows the date and time that the Athena CSV file was created and added to the file_info_v1 table |
NOTE
The following information also displays as columns in the
metadata_v1
table, which is on a deprecation path:
file_id
file_path
ids_type
ids_version
file_attribute_v1 Table Values
The following information displays as columns in file_attribute_v1
tables:
Column Name | Type | Description |
---|---|---|
file_id | String | Shows the file's ID number from the Amazon S3 bucket, which can be joined with the file_id field in the file_info_v1 table |
type | String | Shows the attribute type: Label, Metadata, or Tag |
key | String | (Not included for tags) Shows the key in an attribute's key:value pair |
value | String | Shows the value in an attribute's key:value pair (or the tag value for tags) |
datacubes and datacubes_data Table Values
NOTE
Not all IDSs generate
datacubes
anddatacubes_data
tables.
Each IDS that produces datacubes tables creates two different datacubes tables:
<IDS name>_<IDS major version>_datacubes
<IDS name>_<IDS major version>_datacubes_data
datacubes Table Values
The datacubes
table contains all of a file's metadata and the following:
- Each measure’s name and unit
- Each dimension’s name and unit
Each measure’s name and unit along with each dimension’s name and unit becomes an individual column. How the columns are generated depends on the number of measures and dimensions defined in the IDS file.
The following example shows a datacubes
table:
datacubes_data Table Values
The datacubes_data
table stores measure and dimension values. Similar to name and unit, each measure and dimension value becomes an individual column.
The following example shows a datacubes_data
table:
Best Practices for Working with Athena Tables in the TDP
When running SQL queries in the TDP, it's a best practice to use partitions to speed up query runtimes. Also, make sure that you use UNION
clauses to avoid returning duplicate results.
Use Partitions
To speed up query time, you can specify a partition in athena.json
, which directs Amazon S3 to store different types of IDS data into groups. Ideally, these groups are the ones that you would use in your where
clause to filter data and speed up the query process.
When creating partitions, keep in mind the following:
- Don't select a partition key that is too granular (for example, do not use an UUID). An effective partition key should divide the data into groups into similar sizes. For example, an effective partition key for Empower is
project_name
. - Because of the character restrictions on the partition value, all non-alphanumeric characters in the value are converted to double underscores (
__
). For example, if one of the partition values isDemo\Demo_Clone_2019
, which contains the non-alphanumeric forward slash, this value will be converted toDemo__Demo_Clon_2019
.
NOTE
When you use a partition, Athena will add a new virtual column to the table. The column name will be the partition key, and the value will be the values of this key. However, this extra column is not added into the original data. If you need to add a new partition to the data model, then you must get the latest version.
Use UNION
Clauses in Your SQL Queries
UNION
Clauses in Your SQL QueriesIf you change the version of the IDS that you are using in a pipeline, new version-specific tables are created, which include the processing result. If you query the results, it may be possible that you see duplicate information from results generated by the old and new versions of IDS.
To avoid having duplicate results, it's recommended that you use the UNION
clause. The UNION
clause allows you to combine two separate SELECT
statements to produce a result set as a union of both the SELECT
statements.
When using UNION
clauses in your SQL statements, keep in mind the following:
- To use the
UNION
clause, the fields used in bothSELECT
statements must be in the same order, have the same number, and have the same data type. - When using the
UNION
clause, keep in mind that theUNION
clause produces distinct values in the result set.
Example SQL Statement that Uses a UNION Clause
SELECT column_name(s) FROM lcuv_empower_v8_injection UNION SELECT column_name(s) FROM lcuv_empower_v9_injection;
Result set consists of distinct values.
NOTE
In the result set produced by the previous example statement,
lcuv_empower_v8_injection
is the table generated from version 8.X.Y IDSs.lcuv_empower_v9_injection
is the table generated from version 9.X.Y IDSs.
Updated about 1 year ago