Tetra Data Pipelines

Suggest Edits

Tetra Data Pipelines automatically run a set of actions either on a schedule, or each time new data is ingested into the Tetra Scientific Data and AI Cloud. These pipelines run automated data operations and transformations that can help you create queryable, harmonized Tetra Data, and then enrich and push that data to downstream systems.

📘
NOTE
The Data Lakehouse Architecture early adopter program (EAP) introduced in TDP v4.1.0 uses a new Tetraflow Pipeline artifact that can be configured in a Tetra Data Pipeline instead of a protocol. Tetraflow pipelines run on infrastructure optimized for large-scale data processing and enable users to programmatically transform their Tetra Data in the Lakehouse into analytics-optimized datasets and aggregates.

Pipeline Use Case Examples

You can use Tetra Data Pipelines to do any of the following:

Harmonize data: Parse proprietary instrument output files into a vendor-neutral and scientifically relevant Intermediate Data Schema (IDS), while also storing the data in SQL tables.
Transform data: Add calculated fields to standard data fields.
Contextualize files: Add attributes to files to improve how retrievable they are by search. For example, you can use Tetra Data Pipelines to programmatically add information about samples, experiment names, and laboratories.
Enrich files: Get information from other files within the Tetra Scientific Data and AI Cloud to augment new data.
Convert file formats: Convert IDS JSON files to other formats, such as ADF or CSV.
Push data to third-party applications: Send data to an electronic lab notebook (ELN), laboratory information management system (LIMS), analytics application, or an AI/ML platform.

Get Started with Pipelines

To create a pipeline, see Set Up and Edit Pipelines. To manage pipelines or compare and restore different pipeline versions, see Manage Pipelines.

To extend the capabilities of Tetra Data Pipelines and the Tetra Data Platform (TDP), you can also configure your own custom pipeline logic by using either a Python script and the python-exec protocol or creating a self-service Tetra Data Pipeline (SSP). For custom pipeline setups that are more complex (scripts with 12,000 characters or more), it's recommended that you create and manage your own SSP.

For more information and best practices, see Data Engineering and Tooling and Automation in the TetraConnect Hub. For access, see Access the TetraConnect Hub.

📘
Try Our New AI Assistants

Lab Data Automation Assistant: helps you quickly design, develop, test, and deploy your own lab data automation pipelines for selected ELNs using an AI-powered workflow

Visual Pipeline Builder: provides no-code and low-code tools to help you quickly build and edit pipelines

Tetra Data Pipeline Architecture

The following diagram shows an example Tetra Data Pipeline workflow:

809 — Tetra Data Pipeline architecture diagram

The diagram shows the following workflow:

When an event occurs, such as a file being uploaded to a specific location, all active pipelines determine if that event matches each pipeline's configured trigger conditions.
If the event matches the pipeline's trigger condition, the pipeline's protocol runs a predefined workflow. Protocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that run those steps.
When processing is finished, output files are then indexed according to a predefined IDS and stored in the Tetra Scientific Data and AI Cloud. Through this process, the data becomes easily accessible through search in the TDP user interface, TetraScience API, and SQL queries. You can also push data to third-party applications, such as an ELN, LIMS, analytics application, or an AI/ML platform.

📘
NOTE
Keep in mind the following:

Raw files that don't pass through a Tetra Data Pipeline are still searchable by using default and custom attributes, but aren't parsed.

Pipelines can run on the latest version of a file only. This behavior ensures that previous file versions don't overwrite the latest data. If a pipeline tries to process an outdated or deleted file version, the workflow errors out and the TDP now displays the following error message on the Workflow Details page: "message":"file is outdated or deleted so not running workflow"

Tetra Data Pipeline Components

Tetra Data Pipelines consist of the following components.

Pipeline Component	Description
Trigger	Triggers indicate the criteria a file must meet for pipeline processing to begin. There are three types of trigger conditions: - Simple trigger conditions require files to meet just one condition to trigger the pipeline. For example, you can configure data files that have a specific label to trigger a pipeline. - Complex trigger conditions require files to meet several conditions before they trigger the pipeline. For example, you can require a file to have both a specific label and file path to trigger a pipeline. Complex trigger conditions can be combined by using standard Boolean operators (AND/OR) and can be nested. - Scheduled trigger conditions run pipelines at a specific, recurring time. - Note: Tetraflow pipelines require scheduled trigger conditions. For more information, see Step 3: Select Trigger Conditions in Set Up and Edit Pipelines.
Protocol	Protocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that run those steps. For more information, see Step 4: Select a Protocol in Set Up and Edit Pipelines. Note: You can configure your own custom protocols by using either a Python script and the `python-exec` protocol, or by creating a `protocol.yml` file and using it in a self-service Tetra Data Pipeline (SSP).
Task script	A task script defines one or more functions to handle data processing logic. TetraScience supports task scripts written in Python, NodeJS, and C# currently. A typical task script function will take a file pointer and other configurations as input.
Steps	Steps describe how to run a task. Steps do the work of transforming and processing the data; they are where the business logic "lives". Each step will invoke a specific function from a task script.
Workflow	A workflow is the execution of a pipeline. When an input file triggers a pipeline, a workflow is created to process the input file. A workflow typically contains multiple steps that run one or more task scripts. Note: A pipeline can lead to many workflows.
Notification Settings	Notification settings indicate whether or not to send an email to one or more recipients when a pipeline fails and/or completes successfully. For more information, see Step 4: Set Notifications in Set Up and Edit Pipelines.
Finalization Details and Settings	Finalization details and settings indicate the name of the pipeline, its description, and whether or not the pipeline is enabled (active). You can also indicate if you want to use one or more standby instances, which is an advanced feature that can help speed up data processing. For more information, see Step 2: Configure the Pipeline in Set Up and Edit Pipelines.

Documentation Feedback

Do you have questions about our documentation or suggestions for how we can improve it? Start a discussion in TetraConnect Hub. For access, see Access the TetraConnect Hub.

📘
NOTE
Feedback isn't part of the official TetraScience product documentation. TetraScience doesn't warrant or make any guarantees about the feedback provided, including its accuracy, relevance, or reliability. All feedback is subject to the terms set forth in the TetraConnect Hub Community Guidelines.

Updated 4 days ago

What’s Next

Tetra Data Pipelines

📘
NOTE

Pipeline Use Case Examples

Get Started with Pipelines

📘
Try Our New AI Assistants

Tetra Data Pipeline Architecture

📘
NOTE

Tetra Data Pipeline Components

Documentation Feedback

📘
NOTE

📘NOTE

Pipeline Use Case Examples

Get Started with Pipelines

📘Try Our New AI Assistants

Tetra Data Pipeline Architecture

📘NOTE

Tetra Data Pipeline Components

Documentation Feedback

📘NOTE

📘
NOTE

📘
Try Our New AI Assistants

📘
NOTE

📘
NOTE