SSPs with Multistep Protocols: Decorate, Harmonize, Enrich, and Push Data Programmatically
You can create self-service Tetra Data pipelines (SSPs) that have protocols with multiple steps. Based on one or more trigger conditions, a multistep protocol SSP configuration allows you to do all of the following programmatically:
- Decorate files: Add labels to files to improve how discoverable they are by search. For example, you can use SSPs to programmatically add information about samples, experiment names, and laboratories.
- Harmonize files: Parse proprietary instrument output files into a vendor-neutral and scientifically relevant Intermediate Data Schema (IDS).
- Enrich files: Get information from other files within the Tetra Scientific Data Cloud to augment new data.
- Push data to third-party applications: Send data to an electronic lab notebook (ELN), laboratory information management system (LIMS), or an AI/ML platform.
This topic provides an example setup for decorating, harmonizing, enriching, and pushing data to third-party applications by using a single protocol in a multistep pipeline.
Architecture
Multistep SSP File Journey
The following diagram shows the journey a file takes through the Tetra Data Platform (TDP) as it’s processed by a multistep SSP:
The diagram shows the following workflow:
- A RAW file is ingested.
- The RAW file triggers the multistep pipeline.
- The Decorate step adds labels to the RAW file.
- The Harmonize step parses the RAW file into IDS format and saves the IDS file to the TDP. This step also inherits the labels from the previous step.
- The Enrich step pulls in extra data from another file and modifies the IDS file into a PROCESSED file that has additional data stored and labels applied.
- The Push step uses Shared Settings and Secrets that are stored in the TDP to push the data from the PROCESSED file to a downstream application (Kaggle for this example setup).
Multistep SSP Workflow
The following diagram shows an example SSP workflow for decorating, harmonizing, enriching, and pushing data from the Tetra Scientific Data Cloud to a downstream AI/ML platform:
The diagram shows the following workflow:
- Required third-party API credentials are identified.
- The API credentials are stored as Shared Settings and Secrets in the TDP.
- The “Hello, World!” SSP Example
sspdemo-taskscript
task script version is updated tov5.0.0
. Theconfig.json
file has five exposed functions:
- print-hello-world (
main.print_hello_world
), which is theprint_hello_world
function found in themain.py
file. - decorate-input-file (
main.decorate_input_file
), which is thedecorate_input_file
function found in themain.py
file. - harmonize-input-file (
main.harmonize_input_file
), which is theharmonize_input_file
function found in themain.py
file. - enrich-input-file (
main.enrich_input_file
), which is theenrich_input_file
function found in themain.py
file. - push-step (
main.push_data
), which is thepush_data
function found in themain.py
file that uses the third-party API credentials to push data.
- A protocol named
mutlistep (v1.0.0)
is created. Theprotocol.yml
file provides the protocol name, description, and three configuration items namedkaggle_username
,kaggle_api_key
, andlabels_json
. It also outlines the five pipeline steps, which point to thesspdemo-taskscript
task script and the exposed functions.
Create the Protocol Steps and Task Scripts
Create the required task scripts and protocols for each step that you want to include in your multistep protocol.
For instructions on how to build protocol steps for specific use cases, see the following:
- Decorate Data by Using SSPs: Add Labels to Files
- Harmonize Data by Using SSPs: Map RAW Files to IDS Files
- Enrich Data by Using SSPs: Add Extra Data to Files
- Push Data by Using SSPs: Send Data to an AI/ML Platform
Update the protocol.yml
File’s Input Values to Include the Output of Each Step
protocol.yml
File’s Input Values to Include the Output of Each StepIn single-step protocols, the protocol’s step must correspond to one function and input file within a task script only.
For multistep protocols, the output of one step can be used in the next step. To configure this setup, you must change all but the first step’s input_file_pointer
values to each proceeding step’s .output
file value.
NOTE
For the first step in this example setup (
decorate-input-file-step
), theinput_file_pointer
value must remain as theworkflow.inputFile
.
Single-Step Harmonization Protocol input_file_pointer
Value Example
input:
input_file_pointer: $( workflow.inputFile )
Multistep Protocol Harmonization Step input_file_pointer
Value Example
input:
input_file_pointer: $( steps["decorate-input-file-step"].output )
NOTE
Make sure that your Python functions return the correct file pointer in their return statements. You can do this by looking at the task script files and seeing the return values in the Python functions.
Create and Deploy a Multistep Protocol
Protocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that run those steps. For more information about how to create a protocol, see Protocol YAML Files.
For this example setup, there are four steps:
decorate-input-file-step
harmonize-input-file-step
enrich-input-file-step
push-step
All of the steps use functions that are in the sspdemo-taskscript
task script, which is v5.0.0
of the task script.
Create a Multistep protocol.yml File
Create a multistepprotocol.yml file in your code editor by using the following code snippet:
protocolSchema: "v3"
name: "Multistep - v3 protocol"
description: "Protocol that pushes data to 3rd party application."
config:
labels_json:
label: "Labels that can be added to file."
description: "A json of labels that can be added to a file"
type: "object"
required: false
kaggle_username:
label: "Kaggle Username."
description: "Kaggle Username to use for pushing data."
type: "string"
required: true
kaggle_api_key:
label: "Kaggle API Key."
description: "Kaggle API Key to use for pushing data."
type: "secret"
required: true
steps:
- id: decorate-input-file-step
task:
namespace: private-training-sspdemo
slug: sspdemo-taskscript
version: v5.0.0
function: decorate-input-file
input:
input_file_pointer: $( workflow.inputFile )
labels_json: $( config.labels_json )
- id: harmonize-input-file-step
task:
namespace: private-training-sspdemo
slug: sspdemo-taskscript
version: v5.0.0
function: harmonize-input-file
input:
input_file_pointer: $( workflow.inputFile )
- id: enrich-input-file-step
task:
namespace: private-training-sspdemo
slug: sspdemo-taskscript
version: v5.0.0
function: enrich-input-file
input:
input_file_pointer: $( steps["harmonize-input-file-step"].output )
- id: push-step
task:
namespace: private-training-sspdemo
slug: sspdemo-taskscript
version: v5.0.0
function: push-data
input:
input_file_pointer: $( steps["enrich-input-file-step"].output )
kaggle_username: $( config.kaggle_username )
kaggle_api_key: $( config.kaggle_api_key )
NOTE
When using a new task script version, you must use the new version number when we’re calling that task script in the protocol step. This example
protocol.yml
file refers tov5.0.0
.
Deploy the Protocol
To the deploy the protocol, run the following command from your command line (for example, bash):
ts-sdk put protocol private-{TDP ORG} multistep v1.0.0 {protocol-folder} -c {auth-folder}/auth.json
NOTE
Make sure to replace
{TDP ORG}
with your organization slug,{protocol-folder}
with the local folder that contains your protocol code, and{auth-folder}
with the local folder that contains your authentication information.
NOTE
To redeploy the same version of your code, you must include the
-f
flag in your deployment command. This flag forces the code to overwrite the file. The following are example protocol deployment command examples:
ts-sdk put protocol private-xyz hello-world v1.0.0 ./protocol -f -c auth.json
ts-sdk put task-script private-xyz hello-world v1.0.0 ./task-script -f -c auth.json
For more details about the available arguments, run the following command:
ts-sdk put --help
Create a Pipeline That Uses the Deployed Protocol
To use your new protocol on the TDP, create a new pipeline that uses the protocol that you deployed. Then, upload a file that matches the pipeline’s trigger conditions.
For the configuration elements in the UI configuration of your pipeline, you can use your shared setting (kaggle_username
) and shared secret (kaggle_api_key
).
(Optional) Add Labels to Files
You can also use the following JSON code snippet to add labels:
[
{
"name": "test_label_name1",
"value": "test_value1"
},
{
"name": "test_label_name2",
"value": "test_value2"
}
]
IMPORTANT
Make sure that you add a file to the TDP for the
enrich-input-file-step
. If you don’t add a file for theenrich-input-file-step
to use when enriching existing files, the step fails. For more information, see Enrich Data by Using SSPs: Add Extra Data to Files.
Updated about 2 months ago