Data Lakehouse Architecture (EAP)
The Data Lakehouse Architecture early adopter program (EAP) is designed to help life sciences organizations unlock opportunities for innovation and efficiency in drug discovery and development. The new data storage and management architecture provides up to 10 times faster SQL query performance and an AI/ML-ready data storage format that operates seamlessly across all major data and cloud platform vendors.
The new Data Lakehouse Architecture provides the following benefits:
- Fast, efficient SQL querying
- Ability to create AI/ML-ready datasets while reducing previous data preparation time
- Reduced data storage costs
- Ability to configure Tetra Data Pipelines to read multiple data sources and run at specific times
- Reduced version changes for SQL tables
To help customers get the performance benefits of the new Lakehouse architecture faster, TetraScience account teams will work with each customer to convert their historical Tetra Data to Lakehouse tables. These data backfills can be scoped by specific Intermediate Data Schemas (IDSs) and historical time ranges as needed. Customers can then Transform Tetra Data in the Lakehouse for specific use cases by using a Tetraflow pipeline.
NOTE
The Data Lakehouse Architecture is available to all customers as part of an early adopter program (EAP) and will continue to be updated in future TDP releases. If you are interested in participating in the early adopter program, please contact your customer success manager (CSM).
What is a Data Lakehouse?
A data lakehouse is an open data management architecture that combines the benefits of both data lakes (cost-efficiency and scale) and data warehouses (management and transactions) to enable analytics and AI/ML on all data. It is a highly scalable and performant data storage architecture that breaks data silos and allows seamless, secure data access to authorized users.
TetraScience is adopting the ubiquitous Delta storage format to transform RAW data into refined, cleaned, and harmonized data, while empowering customers to create aggregated datasets as needed. This process is referred to as the “Medallion” architecture, which is outlined in the Databricks documentation.
Architecture
The following diagram shows an example Data Lakehouse workflow in the Tetra Data Platform.
The diagram shows the following process:
- The TDP provides bidirectional integrations for instruments, Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), connectivity middleware, and data science tools. It can also connect with contract research organizations (CROs) and contract development and manufacturing organizations (CDMOs). For a list of available Tetra Integrations, see Tetra Integrations.
- The TDP converts raw data into engineered Tetra Data by using the
raw-to-ids
protocol in a Tetra Data Pipeline and stores it in the Data Lakehouse. - Tetra Data is converted to IDS Delta Tables in an open-source Delta Lake format.
- Two additional types of tables are created for each IDS Delta Table:
- File Info Tables mirror the information provided in the legacy Athena
file_info_v1
tables. - File Attribute Tables mirror the information provided in the legacy Athena
file_attribute_v1
tables.
- File Info Tables mirror the information provided in the legacy Athena
- Customers can then run Tetraflow pipelines to define and schedule data transformations in a familiar SQL language and generate custom, use case-specific Lakehouse tables that are optimized for downstream analytics applications.
Start Using the Data Lakehouse
To start using the Data Lakehouse, do the following:
- Contact your customer success manager (CSM). They'll help you do the following:
- Activate a Steady State process for translating any new data from target IDS versions into Lakehouse tables (IDS Delta Tables, File Info Tables, and File Attribute Tables). These tables mirror the legacy Amazon Athena table structure.
- Backfill your historical Tetra Data from target IDS versions into Lakehouse tables.
- Transform your Tetra Data in the Lakehouse for specific use cases.
Lakehouse Data Access
You can transform and run SQL queries on Tetra Data in the Lakehouse by using any of the following methods:
- A third-party tool and a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) driver endpoint
- The SQL Search page in the TDP user interface
- Databricks Delta Sharing
- Snowflake Data Sharing
View Lakehouse Tables
You can use the SQL Search page in the TDP To view and navigate between Lakehouse tables (Delta Tables), legacy SQL tables in Athena, and the system table database. For more information, see Query SQL Tables in the TDP.
Important Considerations When Accessing Data in the Lakehouse
Lakehouse tables (Delta Tables) use slightly different table names and column names than the legacy SQL table structure used in Amazon Athena. For the initial early adopter program, it's also recommended that you apply data deduplication best practices. Applying these patterns ensures that you're using the latest records in downstream datasets and retrieving the most current data in SQL query results when either running SQL queries on the new tables or setting up a Tetraflow pipeline.
Table Names
To make it easier to share data across all major data and cloud platform vendors, Lakehouse table names use the following format.
Lakehouse Table Name Format: {ORG_SLUG}__tss__default.table_name
Legacy SQL tables in Amazon Athena remain unchanged and continue to use the following format.
Legacy SQL Table Name Format: {ORG_SLUG}.table_name
IMPORTANT
You must use the
{ORG_SLUG}__tss__default.table_name
table name format to access and query Lakehouse tables.
Column Names
The Delta protocol uses different table and column naming conventions than the legacy SQL tables in Amazon Athena. Because these naming conventions are different, Lakehouse tables replace the following characters in their column names with an underscore (_
):
- Accents
- Any characters that aren't part of the American Standard Code for Information Interchange (ASCII)
- The following special characters:
,;{}()\n\t=%/&
- Column names that start with a number
IMPORTANT
For SQL queries that source from Lakehouse tables which contain column names that have any of the affected characters, you must change the column references in your queries to reflect the new underscored (
_
) column names.
Data Deduplication
Lakehouse tables (Delta Tables) are written in "append"
mode, which means new data is added to the tables without overwriting the previous data. This behavior creates multiple copies of each row (record).
To deduplicate the data, any downstream aggregate tables produced by a Tetraflow Pipeline must include the following columns:
- A column that defines the latest record version (Either as a timestamp or a sequential, unique identifier)
- A column that defines a sequential, unique record identifier (index key) for each record
With this setup, you can then group your aggregate tables and SQL queries by the latest record versions for each row (record) in the table.
Data Deduplication Example
In a legacy SQL table in Amazon Athena, the file_path
column serves as a record identifier, because each record is identified by a unique file_path
. Each record also has a created_at
column that determines the latest record version.
Here's an example:
For the following table called empower
:
file_path | col 1 | col 2 | ... | created_at |
---|---|---|---|---|
path1 | 2024-08-27 16:03:14 | |||
path2 | 2024-08-27 16:03:12 | |||
path1 | 2024-08-27 16:04:14 | |||
path2 | 2024-08-27 16:07:12 |
The following query returns a deduplicated dataset with only the latest rows (records):
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY file_path ORDER BY created_at DESC, file_path ASC) as row_num
FROM empower
)
WHERE row_num = 1
Example Response Table
file_path | col 1 | col 2 | ... | created_at | row_num |
---|---|---|---|---|---|
path1 | 2024-08-27 16:04:14 | 1 | |||
path2 | 2024-08-27 16:07:12 | 1 |
IMPORTANT
SQL queries that use window and partition functions are more expensive to run. Make sure that you use these functions only as needed.
Lakehouse Ingestion Data Validation
To help ensure data integrity, Lakehouse tables provide schema enforcement and data validation for files as they’re ingested into the Data Lakehouse. Any file ingestion failures are recorded on the Files Health Monitoring dashboard in the File Failures section. You can reprocess these failed files by running a Reconciliation Job.
IMPORTANT
If an IDS file has a field that ends with a backslash
\
escape character, the file can’t be uploaded to the Data Lakehouse. These file ingestion failures happen because escape characters currently trigger the Lakehouse data ingestion rules. The errors are recorded on the Files Health Monitoring dashboard in the File Failures section. There is no way to reconcile these file failures at this time. A fix for this issue is in development and testing and is scheduled for a future TDP release.
Updated 4 days ago