Decorate Data by Using SSPs: Add Labels to Files

You can create self-service Tetra Data pipelines (SSPs) to add labels to files, which makes the files more discoverable through search. For example, you can use SSPs to programmatically add information about samples, experiment names, and laboratories.

This topic provides an example setup for adding labels to files by using an SSP.

Architecture

The following diagram shows an example SSP workflow for adding labels to files:

Example workflow for adding labels to files by using SSPs

Example SSP workflow for adding labels to files

The diagram shows the following workflow:

  1. The “Hello, World!” SSP Example sspdemo-taskscript task script version is updated to v2.0.0. The config.json file has two exposed functions:
    • print-hello-world (main.print_hello_world), which is the print_hello_world function found in the main.py file.
    • decorate-input-file (main.decorate_input_file), which is the decorate_input_file function found in the main.py file.
  2. A protocol named decorate (v1.0.0) is created. The protocol.yml file provides the protocol name, description, and a configuration item named labels_json. It also outlines one step: decorate-input-file-step. This step points to the sspdemo-taskscript task script and the exposed function, decorate-input-file. The inputs to this function are the input file that kicked off the pipeline workflow and the labels_json configuration item.

📘

NOTE

For an example SSP folder structure, see SSP Folder Structure in the "Hello, World!" SSP Example.

Create and Deploy the Task Script

Task scripts are the building blocks of protocols, so you must build and deploy your task scripts before you can deploy a protocol that uses them.

Task scripts require the following:

  • A config.json file that contains configuration information that exposes and makes your Python functions accessible so that protocols can use them.
  • A Python file that contains python functions (main.py in the following examples) that include the code that’s used in file processing.
  • A requirements.txt file that either specifies any required third-party Python modules, or that is left empty if no modules are needed.

To create and deploy a task script that decorates a file with labels, do the following.

📘

NOTE

For more information about creating custom task scripts, see Task Script Files. For information about testing custom task scripts locally, see Create and Test Custom Task Scripts.

Create a config.json File

Create a config.json file in your code editor by using the following code snippet:

{
    "language": "python",
    "runtime": "python3.11",
    "functions": [
        {
            "slug": "print-hello-world",
            "function": "main.print_hello_world"
        },
        {
            "slug": "decorate-input-file",
            "function": "main.decorate_input_file"
        }
    ]
}

📘

NOTE

You can choose which Python version a task script uses by specifying the "runtime" parameter in the script's config.json file. Python versions 3.7, 3.8, 3.9, 3.10, and 3.11 are supported currently. If you don't include a "runtime" parameter, the script uses Python v3.7 by default.

Create a main.py File

Create a main.py file in your code editor by using the following code snippet:

from ts_sdk.task.__task_script_runner import Context

def print_hello_world(input: dict, context: Context):
    print("Hello World!")
    return "Hello World!"
    
def decorate_input_file(input: dict, context: Context) -> dict:
    print("Start 'decorate_input_file' function...")
    
    input_file_pointer = input["input_file_pointer"]
    file_name = context.get_file_name(input_file_pointer)
    labels_json = input["labels_json"]
    
    added_labels = context.add_labels(
        file=input_file_pointer,
        labels=labels_json,
    )
    
    print("'decorate_input_file' completed")
    return input_file_pointer

Context API

In the Python code provided in this example setup, the Context API is used by importing it in the main.py file (from ts_sdk.task.__task_script_runner import Context). The Context section provides the necessary APIs for the task script to interact with the TDP.

This example setup uses the following Context API endpoints:

File Pointers

File pointers are dictionaries containing the file location information stored in the TDP. File pointers are used throughout task scripts as Python function inputs/outputs and as inputs/outputs to Context API functions.

In the Python code provided in this example setup, one of the inputs to the decorate_input_file function is a file pointer. After decorate, the return value is the same file pointer.

File Pointer Dictionary Example

{
    "type": "s3file",
    "bucket": "datalake",
    "fileKey": "<AWS S3 path/to/file>",
    "version": "<AWS S3 file version ID>"
}

Create a Python Package

Within the task script folder that contains the config.json and main.py files, use Python Poetry to create a Python package and the necessary files to deploy them to the TDP.

Poetry Command Example to Create a Python Package

poetry init
[import packages with "poetry add"]
poetry export --without-hashes --format=requirements.txt > requirements.txt

📘

NOTE

If no packages are added, this poetry export command example produces text in requirements.txt that you must delete to create an empty requirements.txt file. A requirements.txt file is required to deploy the package to the TDP.

Deploy the Task Script

To the deploy the task script, run the following command from your command line (for example, bash):

ts-sdk put task-script private-{TDP ORG} sspdemo-taskscript v2.0.0 {task-script-folder} -c {auth-folder}/auth.json

📘

NOTE

Make sure to replace {TDP ORG} with your organization slug, {task-script-folder} with the local folder that contains your protocol code, and {auth-folder} with the local folder that contains your authentication information.

Also, when creating a new version of a task script and deploying it to the TDP, you must increase the version number. In this example command, the version is increased to v2.0.0.

Create and Deploy a Protocol

Protocols define the business logic of your pipeline by specifying the steps and the functions within task scripts that execute those steps. For more information about how to create a protocol, see Protocol YAML Files.

In the following example, there’s one step: decorate-input-file-step. This step uses the decorate-input-file function that’s in the sspdemo-taskscript task script.

Create a protocol.yml File

Create a protocol.yml file in your code editor by using the following code snippet:

protocolSchema: "v3"
name: "Decorate - v3 protocol"
description: "Protocol that decorates file by adding labels."

config:
  labels_json:
    label: "Labels that can be added to file."
    description: "A json of labels that can be added to a file"
    type: "object"
    required: false

steps:
  - id: decorate-input-file-step
    task:
      namespace: private-training-sspdemo
      slug: sspdemo-taskscript
      version: v2.0.0
      function: decorate-input-file
    input:
      input_file_pointer: $( workflow.inputFile )
      labels_json: $( config.labels_json )

📘

NOTE

When using a new task script version, you must use the new version number when we’re calling that task script in the protocol step. This example protocol.yml file refers to v2.0.0.

Configuration Items

The config property item in protocol.yml files provides the structure of the UI configuration element present when using this protocol in a pipeline created on the TDP. By using these elements, you can create sets of pipelines that are identical, except for a difference in a supplied value.

For example, you can create two pipelines that have different triggers, so they supply files with different sets of labels.

The configuration IDs (for example, labels_json) must be used as protocol steps inputs (for example, $( config.labels_json )). Then, extracted from the input dictionary within Python functions (for example, labels_json = input["labels_json"]).

Deploy the Protocol

To the deploy the protocol, run the following command from your command line (for example, bash):

ts-sdk put protocol private-{TDP ORG} decorate v1.0.0 {protocol-folder} -c {auth-folder}/auth.json

📘

NOTE

Make sure to replace {TDP ORG} with your organization slug, {protocol-folder} with the local folder that contains your protocol code, and {auth-folder} with the local folder that contains your authentication information.

📘

NOTE

To redeploy the same version of your code, you must include the -f flag in your deployment command. This flag forces the code to overwrite the file. The following are example protocol deployment command examples:

  • ts-sdk put protocol private-xyz hello-world v1.0.0 ./protocol -f -c auth.json
  • ts-sdk put task-script private-xyz hello-world v1.0.0 ./task-script -f -c auth.json

For more details about the available arguments, run the following command:

ts-sdk put --help

Create a Pipeline That Uses the Deployed Protocol

To use your new protocol on the TDP, create a new pipeline that uses the protocol that you deployed. Then, upload a file that matches the pipeline’s trigger conditions.

For the configuration element in the UI configuration of your pipeline, you can use the following JSON code snippet to add labels:

[
  {
    "name": "test_label_name1",
    "value": "test_value1"
  },
  {
    "name": "test_label_name2",
    "value": "test_value2"
  }
]