Data Lakehouse Architecture (EAP)

The Data Lakehouse Architecture early adopter program (EAP) is designed to help life sciences organizations unlock opportunities for innovation and efficiency in drug discovery and development. The new data storage and management architecture provides up to 10 times faster SQL query performance and an AI/ML-ready data storage format that operates seamlessly across all major data and cloud platform vendors.

The new Data Lakehouse Architecture provides the following benefits:

  • Fast, efficient SQL querying
  • Ability to create AI/ML-ready datasets while reducing previous data preparation time
  • Reduced data storage costs
  • Ability to configure Tetra Data Pipelines to read multiple data sources and run at specific times
  • Reduced version changes for SQL tables

To help customers get the performance benefits of the new Lakehouse architecture faster, TetraScience account teams will work with each customer to convert their historical Tetra Data to Lakehouse tables. These data backfills can be scoped by specific Intermediate Data Schemas (IDSs) and historical time ranges as needed. Customers can then Transform Tetra Data in the Lakehouse for specific use cases by using a Tetraflow pipeline.

📘

NOTE

The Data Lakehouse Architecture is available to all customers as part of an early adopter program (EAP) and will continue to be updated in future TDP releases. If you are interested in participating in the early adopter program, please contact your customer success manager (CSM).

What is a Data Lakehouse?

A data lakehouse is an open data management architecture that combines the benefits of both data lakes (cost-efficiency and scale) and data warehouses (management and transactions) to enable analytics and AI/ML on all data. It is a highly scalable and performant data storage architecture that breaks data silos and allows seamless, secure data access to authorized users.

TetraScience is adopting the ubiquitous Delta storage format to transform RAW data into refined, cleaned, and harmonized data, while empowering customers to create aggregated datasets as needed. This process is referred to as the “Medallion” architecture, which is outlined in the Databricks documentation.

Architecture

The following diagram shows an example Data Lakehouse workflow in the Tetra Data Platform.

Data Lakehouse architecture diagram

The diagram shows the following process:

  1. The TDP provides bidirectional integrations for instruments, Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), connectivity middleware, and data science tools. It can also connect with contract research organizations (CROs) and contract development and manufacturing organizations (CDMOs). For a list of available Tetra Integrations, see Tetra Integrations.
  2. The TDP converts raw data into engineered Tetra Data by using the raw-to-ids protocol in a Tetra Data Pipeline and stores it in the Data Lakehouse.
  3. Tetra Data is converted to IDS Delta Tables in an open-source Delta Lake format.
  4. Two additional types of tables are created for each IDS Delta Table:
  5. Customers can then run Tetraflow pipelines to define and schedule data transformations in a familiar SQL language and generate custom, use case-specific Lakehouse tables that are optimized for downstream analytics applications.

Start Using the Data Lakehouse

To start using the Data Lakehouse, do the following:

  1. Contact your customer success manager (CSM). They'll help you do the following:
  2. Transform your Tetra Data in the Lakehouse for specific use cases.

Lakehouse Data Access

You can transform and run SQL queries on Tetra Data in the Lakehouse by using any of the following methods:

View Lakehouse Tables

You can use the SQL Search page in the TDP To view and navigate between Lakehouse tables (Delta Tables), legacy SQL tables in Athena, and the system table database. For more information, see Query SQL Tables in the TDP.

Important Considerations When Accessing Data in the Lakehouse

Lakehouse tables (Delta Tables) use slightly different table names and column names than the legacy SQL table structure used in Amazon Athena. For the initial early adopter program, it's also recommended that you apply data deduplication best practices. Applying these patterns ensures that you're using the latest records in downstream datasets and retrieving the most current data in SQL query results when either running SQL queries on the new tables or setting up a Tetraflow pipeline.

Table Names

To make it easier to share data across all major data and cloud platform vendors, Lakehouse table names use the following format.

Lakehouse Table Name Format: {ORG_SLUG}__tss__default.table_name

Legacy SQL tables in Amazon Athena remain unchanged and continue to use the following format.

Legacy SQL Table Name Format: {ORG_SLUG}.table_name

🚧

IMPORTANT

You must use the {ORG_SLUG}__tss__default.table_name table name format to access and query Lakehouse tables.

Column Names

The Delta protocol uses different table and column naming conventions than the legacy SQL tables in Amazon Athena. Because these naming conventions are different, Lakehouse tables replace the following characters in their column names with an underscore (_):

🚧

IMPORTANT

For SQL queries that source from Lakehouse tables which contain column names that have any of the affected characters, you must change the column references in your queries to reflect the new underscored (_) column names.

Data Deduplication

Lakehouse tables (Delta Tables) are written in "append" mode, which means new data is added to the tables without overwriting the previous data. This behavior creates multiple copies of each row (record).

To deduplicate the data, any downstream aggregate tables produced by a Tetraflow Pipeline must include the following columns:

  1. A column that defines the latest record version (Either as a timestamp or a sequential, unique identifier)
  2. A column that defines a sequential, unique record identifier (index key) for each record

With this setup, you can then group your aggregate tables and SQL queries by the latest record versions for each row (record) in the table.

Data Deduplication Example

In a legacy SQL table in Amazon Athena, the file_path column serves as a record identifier, because each record is identified by a unique file_path. Each record also has a created_at column that determines the latest record version.

Here's an example:

For the following table called empower:

file_pathcol 1col 2...created_at
path12024-08-27 16:03:14
path22024-08-27 16:03:12
path12024-08-27 16:04:14
path22024-08-27 16:07:12

The following query returns a deduplicated dataset with only the latest rows (records):

SELECT *
    FROM (
            SELECT *,
                ROW_NUMBER() OVER (PARTITION BY file_path ORDER BY created_at DESC, file_path ASC) as row_num
            FROM empower
        )
    WHERE row_num = 1

Example Response Table

file_pathcol 1col 2...created_atrow_num
path12024-08-27 16:04:141
path22024-08-27 16:07:121

🚧

IMPORTANT

SQL queries that use window and partition functions are more expensive to run. Make sure that you use these functions only as needed.

Lakehouse Ingestion Data Validation

To help ensure data integrity, Lakehouse tables provide schema enforcement and data validation for files as they’re ingested into the Data Lakehouse. Any file ingestion failures are recorded on the Files Health Monitoring dashboard in the File Failures section. You can reprocess these failed files by running a Reconciliation Job.

🚧

IMPORTANT

If an IDS file has a field that ends with a backslash \ escape character, the file can’t be uploaded to the Data Lakehouse. These file ingestion failures happen because escape characters currently trigger the Lakehouse data ingestion rules. The errors are recorded on the Files Health Monitoring dashboard in the File Failures section. There is no way to reconcile these file failures at this time. A fix for this issue is in development and testing and is scheduled for a future TDP release.