Query Lakehouse Tables
This guide explains how to effectively query nested data structures within Lakehouse tables.
For information about the different ways you can access your Lakehouse tables and apply these best practices, see Lakehouse Data Access.
Best Practices For Querying Lakehouse Tables
When querying Lakehouse tables, it's recommended that you do the following:
- Use aliases to improve readability when working with deeply nested structures.
- Use UNNEST for array processing when needing to work with individual elements.
For more best practices on querying hierarchical data, see Use CROSS JOIN and UNNEST in the Amazon Athena User Guide.
Working With Nested Data
Lakehouse tables contain columns that may represent complex nested structures, including the following:
- Objects: collections of key-value pairs
- Arrays: ordered lists of values
- Combinations of objects and arrays at multiple levels of nesting
Basic Dot Notation Access
Accessing Object Fields
To access fields within object columns use dot notation. To navigate through multiple levels of nested objects, use chained dot notation as in the injection_id column in the following example.
Dot Notation Query Example
sql
SELECT
file_metadata.file_id,
runs[1].injection.id as injection_id,
systems[1].name as system_name,
project.name as project
FROM
lcuv_empower_v17 limit 10;
Results
| file_id | injection_id | system_name | project |
|---|---|---|---|
| ac7c32e3-0593 | 1937 | System A | Site_Working_Projects_X |
| b6c244b1-8799 | 1931 | System B | Site_Working_Projects_Y |
| 332f4d40-72af | 2033 | System A | Site_Working_Projects_Z |
Querying Array Elements
Array Unnesting with UNNEST
To transform arrays into rows, use the UNNEST function. The UNNEST function enables you to create rows out of nested values, so you can better understand what your end state data will look like for your analytics use case.
UNNEST Function Example Query
The following query demonstrates using the UNNEST function to analyze peak data for the Empower IDS.
sql
SELECT
file_metadata.file_id,
rp_element.analyte AS analyte,
rp_element.area.value AS area_value,
rp_element.area.percent.value AS area_percent,
rpw_element.value AS peak_width,
FROM lcuv_empower_v17
CROSS JOIN UNNEST(results) AS t1(result_element)
CROSS JOIN UNNEST(result_element.peaks) AS t2(rp_element)
CROSS JOIN UNNEST(rp_element.widths) AS t3(rpw_element);
Results
| file_id | analyte | area_value | area_percent | peak_width |
|---|---|---|---|---|
| 6cfec553 | Lithium | 26527.0 | 16.17 | 0.3272604 |
| 6cfec553 | Sodium | 37879.0 | 23.09 | 0.59 |
| 6cfec553 | Sodium | 37879.0 | 23.09 | 0.313613 |
Working with Normalized Lakehouse Tables
When configuring ids-to-lakehouse pipelines, you can choose to create normalized Lakehouse tables (Normalized IDS and Normalized Datacubes tables). These normalized tables transform Lakehouse tables from the new, nested Delta Table structure to a normalized structure that mirrors the legacy Amazon Athena table structure in the Data Lakehouse. They reduce the need for rewriting existing queries that use the legacy table strucutre.
The normalized IDS Lakehouse tables are structured exactly the same as the legacy Amazon Athena SQL Tables.
The normalized datacubes Lakehouse tables differ slightly from the legacy Amazon Athena datacubes tables. For more information, see Datacubes Lakehouse Tables Best Practices.
Datacubes Lakehouse Tables Best Practices
The updated datacubes Lakehouse table schema streamlines data access, improves query performance, and provides a more robust data model for managing data at scale.
The datacubes Lakehouse table schema removes the more complex file_metadata struct column that's part of the original <ids_name>_datacubes tables from the GA version of the Lakehouse in favor of four new, top-level columns:
file_metadata_file_id(STRING): A unique identifier for each file.file_metadata_ids_path(STRING): The path of the data source.file_metadata_created_at_timestamp(BIGINT): The creation timestamp of the file.file_metadata_created_at(TIMESTAMP): The creation time of the file.
This change is designed to improve data access efficiency by reducing the need for costly joins and nested field lookups.
The original file_metadata column will be retained for backward compatibility, but will no longer be populated with data for new rows. Instead, it will have a NULL value for all newly inserted records. For tables created with this new release, the file_metadata column will always be NULL.
For best performance, customers should update their queries to use one or more of the new top-level columns in their WHERE or ON clauses. For example, instead of querying on t.file_metadata.file_id, use t.file_metadata_file_id.
Documentation Feedback
Do you have questions about our documentation or suggestions for how we can improve it? Start a discussion in TetraConnect Hub. For access, see Access the TetraConnect Hub.
NOTEFeedback isn't part of the official TetraScience product documentation. TetraScience doesn't warrant or make any guarantees about the feedback provided, including its accuracy, relevance, or reliability. All feedback is subject to the terms set forth in the TetraConnect Hub Community Guidelines.
Updated 1 day ago
