Tetra Data Pipelines

Tetra Data Pipelines provide the ability to configure a set of actions to occur automatically each time new data is ingested into the Tetra Scientific Data and AI Cloud. They run automated data operations and transformations that can help you create queryable, harmonized Tetra Data, and then enrich and push that data to downstream systems.

For example, you can use Tetra Data Pipelines to do any of the following:

  • Harmonize data: Parse proprietary instrument output files into a vendor-neutral and scientifically relevant Intermediate Data Schema (IDS), while also storing the data in SQL tables.
  • Transform data: Add calculated fields to standard data fields.
  • Contextualize files: Add attributes to files to improve how retrievable they are by search. For example, you can use Tetra Data Pipelines to programmatically add information about samples, experiment names, and laboratories.
  • Enrich files: Get information from other files within the Tetra Scientific Data and AI Cloud to augment new data.
  • Convert file formats: Convert IDS JSON files to other formats, such as ADF or CSV.
  • Push data to third-party applications: Send data to an electronic lab notebook (ELN), laboratory information management system (LIMS), analytics application, or an AI/ML platform.

For more information and best practices, see Data Engineering and Tooling and Automation in the TetraConnect Hub. For access, see Access the TetraConnect Hub.

📘

NOTE

To extend the capabilities of Tetra Data Pipelines and the Tetra Data Platform (TDP), you can also create your own custom self-service Tetra Data pipelines (SSPs).

Tetra Data Pipeline Architecture

The following diagram shows an example Tetra Data Pipeline workflow:

809

Tetra Data Pipeline architecture diagram

The diagram shows the following workflow:

  1. When an event occurs, such as a file being uploaded to a specific location, all active pipelines determine if that event matches each pipeline's configured trigger conditions.
  2. If the event matches the pipeline's trigger condition, the pipeline's protocol runs a predefined workflow. Protocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that run those steps.
  3. When processing is finished, output files are then indexed according to a predefined IDS and stored in the Tetra Scientific Data and AI Cloud. Through this process, the data becomes easily accessible through search in the TDP user interface, TetraScience API, and SQL queries. You can also push data to third-party applications, such as an ELN, LIMS, analytics application, or an AI/ML platform.

📘

NOTE

Raw files that don't pass through a Tetra Data Pipeline are still searchable by using default and custom attributes, but aren't parsed.

Tetra Data Pipeline Components

Tetra Data Pipelines consist of the following components.

Pipeline ComponentDescription
TriggerTriggers indicate the criteria a file must meet for pipeline processing to begin. There are two types of trigger conditions:

- Simple trigger conditions require files to meet just one condition to trigger the pipeline. For example, you can configure data files that have a specific label to trigger a pipeline.

- Complex trigger conditions require files to meet several conditions before they trigger the pipeline. For example, you can require a file to have both a specific label and file path to trigger a pipeline. Complex trigger conditions can be combined by using standard Boolean operators (AND/OR) and can be nested.

For more information, see Step 1: Define Trigger Conditions in Set Up and Edit Pipelines.
ProtocolProtocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that run those steps.

For more information, see Step 2: Select and Configure the Protocol in Set Up and Edit Pipelines.

Note: You can configure your own custom protocols for use in a self-service Tetra Data pipeline (SSP) by creating a protocol.yml file.
Task scriptA task script defines one or more functions to handle data processing logic. TetraScience supports task scripts written in Python, NodeJS, and C# currently. A typical task script function will take a file pointer and other configurations as input.
StepsSteps describe how to run a task. Steps do the work of transforming and processing the data; they are where the business logic "lives". Each step will invoke a specific function from a task script.
WorkflowA workflow is the execution of a pipeline. When an input file triggers a pipeline, a workflow is created to process the input file. A workflow typically contains multiple steps that run one or more task scripts.

Note: A pipeline can lead to many workflows.
Notification SettingsNotification settings indicate whether or not to send an email to one or more recipients when a pipeline fails and/or completes successfully.

For more information, see Step 3: Set Notifications in Set Up and Edit Pipelines.
Finalization Details and SettingsFinalization details and settings indicate the name of the pipeline, its description, and whether or not the pipeline is enabled (active). You can also indicate if you want to use one or more standby instances, which is an advanced feature that can help speed up data processing.

For more information, see Step 4: Finalize the Details in Set Up and Edit Pipelines.

What’s Next

Now that you have an overview of pipeline processing and know the terminology, you are ready to set up or edit a pipeline, monitor pipeline processing, and review the output.