Data Lakehouse Architecture

The TetraScience Data Lakehouse is designed to help life sciences organizations unlock opportunities for innovation and efficiency in drug discovery and development. The new data storage and management architecture provides 50% to 300% faster SQL query performance than the legacy Tetra Data Lake and an AI/ML-ready data storage format that operates seamlessly across all major data and cloud platform vendors.

What is a Data Lakehouse?

A data lakehouse is an open data management architecture that combines the benefits of both data lakes (cost-efficiency and scale) and data warehouses (management and transactions) to enable analytics and AI/ML on all data. It is a highly scalable and performant data storage architecture that breaks data silos and allows seamless, secure data access to authorized users.

TetraScience has adopted the ubiquitous Delta storage format to transform RAW data into refined, cleaned, and harmonized data, while empowering customers to create aggregated datasets as needed. This process is referred to as the “Medallion” architecture, which is outlined in the Databricks documentation.

Benefits

The new Data Lakehouse Architecture provides the following benefits:

  • Share data directly with any Databricks and Snowflake account

  • Run SQL queries faster and more efficiently

  • Create AI/ML-ready datasets while reducing data preparation time

  • Reduce data storage costs

  • Configure Tetraflow pipelines to read multiple data sources and run at specific times

  • Reduce version changes for SQL tables

To start getting the performance benefits of the new Lakehouse architecture, customers can convert their historical Tetra Data to Lakehouse tables. These data backfills can be scoped by specific Intermediate Data Schemas (IDSs) and historical time ranges as needed. Customers can then Transform Tetra Data in the Lakehouse for specific use cases by using a Tetraflow pipeline.

Architecture

The following diagram shows an example Data Lakehouse workflow in the Tetra Data Platform.

Data Lakehouse architecture diagram

The diagram shows the following process:

  1. The TDP provides bidirectional integrations for instruments, Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), connectivity middleware, and data science tools. It can also connect with contract research organizations (CROs) and contract development and manufacturing organizations (CDMOs). For a list of available Tetra Integrations, see Tetra Integrations.
  2. The TDP converts raw data into engineered Tetra Data by using the raw-to-ids protocol in a Tetra Data Pipeline and stores it in the Data Lakehouse.
  3. Tetra Data is converted to Lakehouse Tables in an open-source Delta Lake format by using the ids-to-lakehouse protocol in another pipeline. Two types of Lakehouse tables are created:
    • IDS Lakehouse tables, which are the new nested-format Delta Tables. To access the new IDS Lakehouse tables, customers will need to update their existing queries to align with the new table structure eventually.
    • Normalized IDS Lakehouse tables, which transform IDS Lakehouse tables from the new, nested Delta Table structure to a normalized structure that mirrors the legacy Amazon Athena table structure in the Data Lakehouse. They reduce the need for rewriting existing queries and make it easier to point downstream analytics applications to Tetra Data in the Lakehouse.
  4. Two additional types of tables are created for each IDS Lakehouse Table:
  5. Customers can then run Tetraflow pipelines to define and schedule data transformations in a familiar SQL language and generate custom, use case-specific Lakehouse tables that are optimized for downstream analytics applications.

Start Using the Data Lakehouse

To start using the Data Lakehouse, do the following:

  1. Convert your data into Lakehouse tables by using the ids-to-lakehouse protocol in a pipeline.
  2. Transform your data in the Lakehouse for specific use cases by using Tetraflow pipelines.

Lakehouse Data Access

You can transform and run SQL queries on data in the Lakehouse by using any of the following methods:

For best practices on querying data in the Lakehouse, see Query Lakehouse Tables.

🚧

IMPORTANT

SQL queries run against the Lakehouse tables generated from Cytiva AKTA IDS (akta) SQL tables aren’t backwards compatible with the legacy Amazon Athena SQL table queries. When akta IDS SQL tables are converted into Lakehouse tables, the following table and column names are updated:

The source akta_v_run Amazon Athena SQL table is replaced with an akta_v_root Lakehouse table.

The akta_v_run.time column in the source Amazon Athena SQL tables is renamed to akta_v_root.run_time.

The akta_v_run.note column in the source Amazon Athena SQL tables is renamed to akta_v_root.run_note.

These updated table and column names must be added to any queries run against the new akta IDS Lakehouse tables. A fix for this issue is in development and testing and is scheduled for a future release.

Lakehouse Tables

When data is converted into Lakehouse Tables, which use an open-source Delta Lake format, two types of Lakehouse tables get created:

  • IDS Lakehouse tables, which are the new nested-format Delta Tables. To access the new IDS Lakehouse tables, customers will need to update their existing queries to align with the new table structure eventually.
  • Normalized IDS Lakehouse tables, which transform IDS Lakehouse tables from the new, nested Delta Table structure to a normalized structure that mirrors the legacy Amazon Athena table structure in the Data Lakehouse. They reduce the need for rewriting existing queries and make it easier to point downstream analytics applications to Tetra Data in the Lakehouse.

Two additional types of tables are created to provide additional context on the files and metadata generated in your organization:

🚧

IMPORTANT

Lakehouse tables (Delta Tables) use slightly different table names and column names than the legacy SQL table structure used in Amazon Athena. For more information, see Table Names and Column Names.

Table Names

To make it easier to share data across all major data and cloud platform vendors, Lakehouse table names use the following format.

Lakehouse Table Name Format:

{ORG_SLUG}__tss__{IDS_SLUG}.table_name for tables generated using the Generally Available version of the Lakehouse (4.3.0+). A new top-level schema will be generated for each IDS ingested into the Lakehouse.

{ORG_SLUG}__tss__default.table_name for tables generated in the Early Adopter Program period (TDP 4.2.x)

Legacy SQL tables in Amazon Athena remain unchanged and continue to use the following format.

Legacy SQL Table Name Format: {ORG_SLUG}.table_name

🚧

IMPORTANT

You must use the {ORG_SLUG}__tss__default.table_name table name format to access and query Lakehouse tables.

Column Names

The Delta protocol uses different table and column naming conventions than the legacy SQL tables in Amazon Athena. Because these naming conventions are different, Lakehouse tables replace the following characters in their column names with an underscore (_):

🚧

IMPORTANT

For SQL queries that source from Lakehouse tables which contain column names that have any of the affected characters, you must change the column references in your queries to reflect the new underscored (_) column names.

View Lakehouse Tables

You can use the SQL Search page in the TDP To view and navigate between Lakehouse tables (Delta Tables), legacy SQL tables in Athena, and the system table database. For more information, see Query SQL Tables in the TDP.

Lakehouse Ingestion Data Validation

To help ensure data integrity, Lakehouse tables provide schema enforcement and data validation for files as they’re ingested into the Data Lakehouse. Any file ingestion failures are recorded on the Files Health Monitoring dashboard in the File Failures section. You can reprocess these failed files by running a Reconciliation Job.