TetraScience automated data operations and transformations are handled by Tetra Data Pipelines. A pipeline is a way to configure a set of actions to occur automatically each time new data is ingested into the data lake. Pipelines are used to:
- Harmonize data from RAW to the standardized IDS format. The most commonly-used pipelines parse and standardize raw data from instruments, cloud repositories, and other laboratory data sources.
- Extract, decorate, and validate files. For example, you might want to extract "compound id" from the file path and add "compound id" as metadata to the RAW file.
- Convert IDS JSON to other formats, such as IDS to ADF or CSV.
- Merge data from different files, such as merging plate concentration file from liquid handler to a plate reader file to create an IC50 curve.
TetraScience also supports pipelines for:
- Data transformation - Adding a calculated field to the standard data fields
- Data transportation - Uploading results to an external Electronic Lab Notebook (ELN) or LIMS
A pipeline consists of a trigger, a protocol, notification details, and finalization details and settings. Each of these is defined in the terminology section at the end of this topic.
When an event occurs, such as a file is ingested into a specific location in the data lake, the active data pipelines determine whether that event matches their own individual pipeline trigger conditions. (Trigger conditions are defined by the person who sets up or edits the pipeline.)
If the event matches the trigger condition, the protocol, which consists of processing steps and configuration information, starts the execution of a workflow. Protocols contain one or more steps, which are defined in task scripts, and configuration instructions. Task Scripts contain the code for the business logic needed to process the data.
When processing is finished, output files are indexed according to a predefined schema and stored in the data lake, where they can be easily searched and filtered in the TDP or another tool like Tableau or TIBCO Spotfire. Files can also be sent to a data target, which is simply a place where the processed output files are sent. Common data targets are an external ELN or LIMS.
Raw files that do not pass through a Data Pipeline are still searchable using Default and Custom Metadata fields but are not parsed unless they go through a Data Pipeline.
Pipeline terminology appears in the table below.
Basic Pipeline Terminology
Package of programmed instructions that indicates the actions to take when new data that meets trigger conditions is ingested in the data lake. You can think of a pipeline as a recipe or a plan for execution. A pipeline consists of four parts: 1) trigger, 2) protocol 3) notification instructions, and 4) details/settings such as the name of the pipeline, its description, metadata, and tags.
The criteria a newly ingested data file must satisfy in order to run the actions. Every time there is a new file or file version in the data lake, the Data Pipelines will check if the new file matches any trigger conditions. Admins may configure the trigger directly through the TetraScience User Interface by navigating to the Pipeline page via the main dropdown menu and defining a set of file conditions.
For example, a trigger statement could indicate that any RAW files that are in a specific directory will be processed using the pipeline's protocol.
Steps and configurations that are used to process data. The protocol consists of two files: protocol.json, which contains the steps and configuration information, and a script.js file, which orchestrates the workflow. In a sense, the protocol is the "heart" of the pipeline.
Describe how to run a task. Steps do the work of transforming and processing the data; they are where the business logic "lives". Steps are defined in task scripts.
Execution of a pipeline. When an input file triggers a pipeline, a workflow is created to process the input file.
A workflow typically contains multiple steps that execute one or more Task scripts. Note that a pipeline can lead to many workflows.
Indicates whether to send an email to one or more recipients when a pipeline fails and/or completes successfully.
Finalization Details and Settings
Allows you to provide the name of the pipeline, description, and whether the pipeline is enabled (active) or not. You can also indicate whether to use one or more standby instances, which is an advanced feature that can speed processing.
Updated 4 months ago
Now that you have an overview of pipeline processing and know the terminology, you are ready to set up or edit a pipeline, monitor pipeline processing, and review the output.