Data Lakehouse Architecture
The TetraScience Data Lakehouse is designed to help life sciences organizations unlock opportunities for innovation and efficiency in drug discovery and development. The new data storage and management architecture provides 50% to 300% faster SQL query performance than the legacy Tetra Data Lake and an AI/ML-ready data storage format that operates seamlessly across all major data and cloud platform vendors.
What is a Data Lakehouse?
A data lakehouse is an open data management architecture that combines the benefits of both data lakes (cost-efficiency and scale) and data warehouses (management and transactions) to enable analytics and AI/ML on all data. It is a highly scalable and performant data storage architecture that breaks data silos and allows seamless, secure data access to authorized users.
TetraScience has adopted the ubiquitous Delta storage format to transform RAW data into refined, cleaned, and harmonized data, while empowering customers to create aggregated datasets as needed. This process is referred to as the “Medallion” architecture, which is outlined in the Databricks documentation.
Benefits
The new Data Lakehouse Architecture provides the following benefits:
-
Share data directly with any Databricks and Snowflake account
-
Run SQL queries faster and more efficiently
-
Create AI/ML-ready datasets while reducing data preparation time
-
Reduce data storage costs
-
Configure Tetraflow pipelines to read multiple data sources and run at specific times
-
Reduce version changes for SQL tables
To start getting the performance benefits of the new Lakehouse architecture, customers can convert their historical Tetra Data to Lakehouse tables. These data backfills can be scoped by specific Intermediate Data Schemas (IDSs) and historical time ranges as needed. Customers can then Transform Tetra Data in the Lakehouse for specific use cases by using a Tetraflow pipeline.
Architecture
The following diagram shows an example Data Lakehouse workflow in the Tetra Data Platform.
The diagram shows the following process:
- The TDP provides bidirectional integrations for instruments, Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), connectivity middleware, and data science tools. It can also connect with contract research organizations (CROs) and contract development and manufacturing organizations (CDMOs). For a list of available Tetra Integrations, see Tetra Integrations.
- The TDP converts raw data into engineered Tetra Data by using the
raw-to-ids
protocol in a Tetra Data Pipeline and stores it in the Data Lakehouse. - Tetra Data is converted to Lakehouse Tables in an open-source Delta Lake format by using the
ids-to-lakehouse
protocol in another pipeline. Two types of Lakehouse tables are created:- IDS Lakehouse tables, which are the new nested-format Delta Tables. To access the new IDS Lakehouse tables, customers will need to update their existing queries to align with the new table structure eventually.
- Normalized IDS Lakehouse tables, which transform IDS Lakehouse tables from the new, nested Delta Table structure to a normalized structure that mirrors the legacy Amazon Athena table structure in the Data Lakehouse. They reduce the need for rewriting existing queries and make it easier to point downstream analytics applications to Tetra Data in the Lakehouse.
- Two additional types of tables are created for each IDS Lakehouse Table:
- File Info Tables mirror the information provided in the legacy Athena
file_info_v1
tables. - File Attribute Tables mirror the information provided in the legacy Athena
file_attribute_v1
tables.
- File Info Tables mirror the information provided in the legacy Athena
- Customers can then run Tetraflow pipelines to define and schedule data transformations in a familiar SQL language and generate custom, use case-specific Lakehouse tables that are optimized for downstream analytics applications.
Start Using the Data Lakehouse
To start using the Data Lakehouse, do the following:
- Convert your data into Lakehouse tables by using the
ids-to-lakehouse
protocol in a pipeline. - Transform your data in the Lakehouse for specific use cases by using Tetraflow pipelines.
Lakehouse Data Access
You can transform and run SQL queries on data in the Lakehouse by using any of the following methods:
- A third-party tool and a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) driver endpoint
- The SQL Search page in the TDP user interface
- Databricks Delta Sharing
- Snowflake Data Sharing
For best practices on querying data in the Lakehouse, see Query Lakehouse Tables.
IMPORTANT
SQL queries run against the Lakehouse tables generated from Cytiva AKTA IDS (
akta
) SQL tables aren’t backwards compatible with the legacy Amazon Athena SQL table queries. Whenakta
IDS SQL tables are converted into Lakehouse tables, the following table and column names are updated:The source
akta_v_run
Amazon Athena SQL table is replaced with anakta_v_root
Lakehouse table.The
akta_v_run.time
column in the source Amazon Athena SQL tables is renamed toakta_v_root.run_time
.The
akta_v_run.note
column in the source Amazon Athena SQL tables is renamed toakta_v_root.run_note
.These updated table and column names must be added to any queries run against the new akta IDS Lakehouse tables. A fix for this issue is in development and testing and is scheduled for a future release.
Lakehouse Tables
When data is converted into Lakehouse Tables, which use an open-source Delta Lake format, two types of Lakehouse tables get created:
- IDS Lakehouse tables, which are the new nested-format Delta Tables. To access the new IDS Lakehouse tables, customers will need to update their existing queries to align with the new table structure eventually.
- Normalized IDS Lakehouse tables, which transform IDS Lakehouse tables from the new, nested Delta Table structure to a normalized structure that mirrors the legacy Amazon Athena table structure in the Data Lakehouse. They reduce the need for rewriting existing queries and make it easier to point downstream analytics applications to Tetra Data in the Lakehouse.
Two additional types of tables are created to provide additional context on the files and metadata generated in your organization:
- File Info Tables mirror the information provided in the legacy Athena
file_info_v1
tables. - File Attribute Tables mirror the information provided in the legacy Athena
file_attribute_v1
tables.
IMPORTANT
Lakehouse tables (Delta Tables) use slightly different table names and column names than the legacy SQL table structure used in Amazon Athena. For more information, see Table Names and Column Names.
Table Names
To make it easier to share data across all major data and cloud platform vendors, Lakehouse table names use the following format.
Lakehouse Table Name Format:
{ORG_SLUG}__tss__{IDS_SLUG}.table_name
for tables generated using the Generally Available version of the Lakehouse (4.3.0+). A new top-level schema will be generated for each IDS ingested into the Lakehouse.
{ORG_SLUG}__tss__default.table_name
for tables generated in the Early Adopter Program period (TDP 4.2.x)
Legacy SQL tables in Amazon Athena remain unchanged and continue to use the following format.
Legacy SQL Table Name Format: {ORG_SLUG}.table_name
IMPORTANT
You must use the
{ORG_SLUG}__tss__default.table_name
table name format to access and query Lakehouse tables.
Column Names
The Delta protocol uses different table and column naming conventions than the legacy SQL tables in Amazon Athena. Because these naming conventions are different, Lakehouse tables replace the following characters in their column names with an underscore (_
):
- Accents
- Any characters that aren't part of the American Standard Code for Information Interchange (ASCII)
- The following special characters:
,;{}()\n\t=%/&
- Column names that start with a number
IMPORTANT
For SQL queries that source from Lakehouse tables which contain column names that have any of the affected characters, you must change the column references in your queries to reflect the new underscored (
_
) column names.
View Lakehouse Tables
You can use the SQL Search page in the TDP To view and navigate between Lakehouse tables (Delta Tables), legacy SQL tables in Athena, and the system table database. For more information, see Query SQL Tables in the TDP.
Lakehouse Ingestion Data Validation
To help ensure data integrity, Lakehouse tables provide schema enforcement and data validation for files as they’re ingested into the Data Lakehouse. Any file ingestion failures are recorded on the Files Health Monitoring dashboard in the File Failures section. You can reprocess these failed files by running a Reconciliation Job.
Updated 1 day ago