SSPs with Multistep Protocols: Decorate, Harmonize, Enrich, and Push Data Programmatically

You can create self-service Tetra Data pipelines (SSPs) that have protocols with multiple steps. Based on one or more trigger conditions, a multistep protocol SSP configuration allows you to do all of the following programmatically:

  • Decorate files: Add labels to files to improve how discoverable they are by search. For example, you can use SSPs to programmatically add information about samples, experiment names, and laboratories.
  • Harmonize files: Parse proprietary instrument output files into a vendor-neutral and scientifically relevant Intermediate Data Schema (IDS).
  • Enrich files: Get information from other files within the Tetra Scientific Data Cloud to augment new data.
  • Push data to third-party applications: Send data to an electronic lab notebook (ELN), laboratory information management system (LIMS), or an AI/ML platform.

This topic provides an example setup for decorating, harmonizing, enriching, and pushing data to third-party applications by using a single protocol in a multistep pipeline.

Architecture

Multistep SSP File Journey

The following diagram shows the journey a file takes through the Tetra Data Platform (TDP) as it’s processed by a multistep SSP:

Multistep SSP file journey diagram

Multistep SSP file journey example

The diagram shows the following workflow:

  1. A RAW file is ingested.
  2. The RAW file triggers the multistep pipeline.
  3. The Decorate step adds labels to the RAW file.
  4. The Harmonize step parses the RAW file into IDS format and saves the IDS file to the TDP. This step also inherits the labels from the previous step.
  5. The Enrich step pulls in extra data from another file and modifies the IDS file into a PROCESSED file that has additional data stored and labels applied.
  6. The Push step uses Shared Settings and Secrets that are stored in the TDP to push the data from the PROCESSED file to a downstream application (Kaggle for this example setup).

Multistep SSP Workflow

The following diagram shows an example SSP workflow for decorating, harmonizing, enriching, and pushing data from the Tetra Scientific Data Cloud to a downstream AI/ML platform:

Multistep SSP workflow example diagram

Multistep SSP workflow example

The diagram shows the following workflow:

  1. Required third-party API credentials are identified.
  2. The API credentials are stored as Shared Settings and Secrets in the TDP.
  3. The “Hello, World!” SSP Example sspdemo-taskscript task script version is updated to v5.0.0. The config.json file has five exposed functions:
  • print-hello-world (main.print_hello_world), which is the print_hello_world function found in the main.py file.
  • decorate-input-file (main.decorate_input_file), which is the decorate_input_file function found in the main.py file.
  • harmonize-input-file (main.harmonize_input_file), which is the harmonize_input_file function found in the main.py file.
  • enrich-input-file (main.enrich_input_file), which is the enrich_input_file function found in the main.py file.
  • push-step (main.push_data), which is the push_data function found in the main.py file that uses the third-party API credentials to push data.
  1. A protocol named mutlistep (v1.0.0) is created. The protocol.yml file provides the protocol name, description, and three configuration items named kaggle_username, kaggle_api_key, and labels_json. It also outlines the five pipeline steps, which point to the sspdemo-taskscript task script and the exposed functions.

Create the Protocol Steps and Task Scripts

Create the required task scripts and protocols for each step that you want to include in your multistep protocol.

For instructions on how to build protocol steps for specific use cases, see the following:

Update the protocol.yml File’s Input Values to Include the Output of Each Step

In single-step protocols, the protocol’s step must correspond to one function and input file within a task script only.

For multistep protocols, the output of one step can be used in the next step. To configure this setup, you must change all but the first step’s input_file_pointer values to each proceeding step’s .output file value.

📘

NOTE

For the first step in this example setup (decorate-input-file-step), the input_file_pointer value must remain as the workflow.inputFile.

Single-Step Harmonization Protocol input_file_pointer Value Example

input:
      input_file_pointer: $( workflow.inputFile )

Multistep Protocol Harmonization Step input_file_pointer Value Example

input:
      input_file_pointer: $( steps["decorate-input-file-step"].output )

📘

NOTE

Make sure that your Python functions return the correct file pointer in their return statements. You can do this by looking at the task script files and seeing the return values in the Python functions.

Create and Deploy a Multistep Protocol

Protocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that run those steps. For more information about how to create a protocol, see Protocol YAML Files.

For this example setup, there are four steps:

  1. decorate-input-file-step
  2. harmonize-input-file-step
  3. enrich-input-file-step
  4. push-step

All of the steps use functions that are in the sspdemo-taskscript task script, which is v5.0.0 of the task script.

Create a Multistep protocol.yml File

Create a multistepprotocol.yml file in your code editor by using the following code snippet:

protocolSchema: "v3"
name: "Multistep - v3 protocol"
description: "Protocol that pushes data to 3rd party application."

config:
  labels_json:
    label: "Labels that can be added to file."
    description: "A json of labels that can be added to a file"
    type: "object"
    required: false
  kaggle_username:
    label: "Kaggle Username."
    description: "Kaggle Username to use for pushing data."
    type: "string"
    required: true
  kaggle_api_key:
    label: "Kaggle API Key."
    description: "Kaggle API Key to use for pushing data."
    type: "secret"
    required: true

steps:
  - id: decorate-input-file-step
    task:
      namespace: private-training-sspdemo
      slug: sspdemo-taskscript
      version: v5.0.0
      function: decorate-input-file
    input:
      input_file_pointer: $( workflow.inputFile )
      labels_json: $( config.labels_json )
  - id: harmonize-input-file-step
    task:
      namespace: private-training-sspdemo
      slug: sspdemo-taskscript
      version: v5.0.0
      function: harmonize-input-file
    input:
      input_file_pointer: $( workflow.inputFile )
  - id: enrich-input-file-step
    task:
      namespace: private-training-sspdemo
      slug: sspdemo-taskscript
      version: v5.0.0
      function: enrich-input-file
    input:
      input_file_pointer: $( steps["harmonize-input-file-step"].output )
  - id: push-step
    task:
      namespace: private-training-sspdemo
      slug: sspdemo-taskscript
      version: v5.0.0
      function: push-data
    input:
      input_file_pointer: $( steps["enrich-input-file-step"].output )
      kaggle_username: $( config.kaggle_username )
      kaggle_api_key: $( config.kaggle_api_key )

📘

NOTE

When using a new task script version, you must use the new version number when we’re calling that task script in the protocol step. This example protocol.yml file refers to v5.0.0.

Deploy the Protocol

To the deploy the protocol, run the following command from your command line (for example, bash):

ts-sdk put protocol private-{TDP ORG} multistep v1.0.0 {protocol-folder} -c {auth-folder}/auth.json

📘

NOTE

Make sure to replace {TDP ORG} with your organization slug, {protocol-folder} with the local folder that contains your protocol code, and {auth-folder} with the local folder that contains your authentication information.

📘

NOTE

To redeploy the same version of your code, you must include the -f flag in your deployment command. This flag forces the code to overwrite the file. The following are example protocol deployment command examples:

  • ts-sdk put protocol private-xyz hello-world v1.0.0 ./protocol -f -c auth.json
  • ts-sdk put task-script private-xyz hello-world v1.0.0 ./task-script -f -c auth.json

For more details about the available arguments, run the following command:

ts-sdk put --help

Create a Pipeline That Uses the Deployed Protocol

To use your new protocol on the TDP, create a new pipeline that uses the protocol that you deployed. Then, upload a file that matches the pipeline’s trigger conditions.

For the configuration elements in the UI configuration of your pipeline, you can use your shared setting (kaggle_username) and shared secret (kaggle_api_key).

(Optional) Add Labels to Files

You can also use the following JSON code snippet to add labels:

[
  {
    "name": "test_label_name1",
    "value": "test_value1"
  },
  {
    "name": "test_label_name2",
    "value": "test_value2"
  }
]

🚧

IMPORTANT

Make sure that you add a file to the TDP for the enrich-input-file-step. If you don’t add a file for the enrich-input-file-step to use when enriching existing files, the step fails. For more information, see Enrich Data by Using SSPs: Add Extra Data to Files.