Query Lakehouse Tables

This guide explains how to effectively query nested data structures within Lakehouse tables.

For information about the different ways you can access your Lakehouse tables and apply these best practices, see Lakehouse Data Access.

Best Practices For Querying Lakehouse Tables

When querying Lakehouse tables, it's recommended that you do the following:

  • Use aliases to improve readability when working with deeply nested structures.
  • Use UNNEST for array processing when needing to work with individual elements.

For more best practices on querying hierarchical data, see Use CROSS JOIN and UNNEST in the Amazon Athena User Guide.

Working With Nested Data

Lakehouse tables contain columns that may represent complex nested structures, including the following:

  • Objects: collections of key-value pairs
  • Arrays: ordered lists of values
  • Combinations of objects and arrays at multiple levels of nesting

Basic Dot Notation Access

Accessing Object Fields

To access fields within object columns use dot notation. To navigate through multiple levels of nested objects, use chained dot notation as in the injection_id column in the following example.

Dot Notation Query Example
sql  
SELECT  
  file_metadata.file_id,  
  runs[1].injection.id as injection_id,  
  systems[1].name as system_name,  
  project.name as project  
FROM  
  lcuv_empower_v17 limit 10;

Results

file_idinjection_idsystem_nameproject
ac7c32e3-05931937System ASite_Working_Projects_X
b6c244b1-87991931System BSite_Working_Projects_Y
332f4d40-72af2033System ASite_Working_Projects_Z

Querying Array Elements

Array Unnesting with UNNEST

To transform arrays into rows, use the UNNEST function. The UNNEST function enables you to create rows out of nested values, so you can better understand what your end state data will look like for your analytics use case.

UNNEST Function Example Query

The following query demonstrates using the UNNEST function to analyze peak data for the Empower IDS.

sql  
SELECT 
        file_metadata.file_id,
        rp_element.analyte AS analyte,
        rp_element.area.value AS area_value,
        rp_element.area.percent.value AS area_percent,
        rpw_element.value AS peak_width,
    FROM lcuv_empower_v17
    CROSS JOIN UNNEST(results) AS t1(result_element)
    CROSS JOIN UNNEST(result_element.peaks) AS t2(rp_element)
    CROSS JOIN UNNEST(rp_element.widths) AS t3(rpw_element);

Results

file_idanalytearea_valuearea_percentpeak_width
6cfec553Lithium26527.016.170.3272604
6cfec553Sodium37879.023.090.59
6cfec553Sodium37879.023.090.313613

Working with Normalized Lakehouse Tables

When configuring ids-to-lakehouse pipelines, you can choose to create normalized Lakehouse tables (Normalized IDS and Normalized Datacubes tables). These normalized tables transform Lakehouse tables from the new, nested Delta Table structure to a normalized structure that mirrors the legacy Amazon Athena table structure in the Data Lakehouse. They reduce the need for rewriting existing queries that use the legacy table strucutre.

The normalized IDS Lakehouse tables are structured exactly the same as the legacy Amazon Athena SQL Tables.

The normalized datacubes Lakehouse tables differ slightly from the legacy Amazon Athena datacubes tables. For more information, see Datacubes Lakehouse Tables Best Practices.

Datacubes Lakehouse Tables Best Practices

The updated datacubes Lakehouse table schema streamlines data access, improves query performance, and provides a more robust data model for managing data at scale.

The datacubes Lakehouse table schema removes the more complex file_metadata struct column that's part of the original <ids_name>_datacubes tables from the GA version of the Lakehouse in favor of four new, top-level columns:

  • file_metadata_file_id (STRING): A unique identifier for each file.
  • file_metadata_ids_path (STRING): The path of the data source.
  • file_metadata_created_at_timestamp (BIGINT): The creation timestamp of the file.
  • file_metadata_created_at (TIMESTAMP): The creation time of the file.

This change is designed to improve data access efficiency by reducing the need for costly joins and nested field lookups.

The original file_metadata column will be retained for backward compatibility, but will no longer be populated with data for new rows. Instead, it will have a NULL value for all newly inserted records. For tables created with this new release, the file_metadata column will always be NULL.

For best performance, customers should update their queries to use one or more of the new top-level columns in their WHERE or ON clauses. For example, instead of querying on t.file_metadata.file_id, use t.file_metadata_file_id.

Documentation Feedback

Do you have questions about our documentation or suggestions for how we can improve it? Start a discussion in TetraConnect Hub. For access, see Access the TetraConnect Hub.

📘

NOTE

Feedback isn't part of the official TetraScience product documentation. TetraScience doesn't warrant or make any guarantees about the feedback provided, including its accuracy, relevance, or reliability. All feedback is subject to the terms set forth in the TetraConnect Hub Community Guidelines.