Context API
To interact with Tetra Data Platform task scripts you will need to use the context
object. The context
object provides functions for you to read files, write files, update metadata, get pipeline configuration information, and more. This topic contains:
- An example of a task script entry-point function
- List of the context API properties
- List of context API functions
- More detailed information about context API functions
Task Script Entry-Point Function Example
If your task script entry-point function process_file
is in file main.py, you should define process_file
like this:
def process_file(input: dict, context: object):
print("Starting task")
input_data = context.read_file(input["inputFile"])
# your business logic ...
print("Task completed")
You can define input
in the protocol's script.js. But you don't need to define context
. It will be passed in as the second argument automatically.
Context API Properties
Context API provides properties that allow you to read and write files and more.
Parameter | Type | Description |
---|---|---|
context.org_slug | string | Your organization's slug. |
context.pipeline_id | uuid | Pipeline unique ID. |
context.workflow_id | uuid | Workflow unique ID. |
context.master_script_namespace | string | Namespace for the protocol script. |
context.master_script_slug | string | Slug for the protocol script (script.js). |
context.master_script_version | string | Version of the protocol script (e.g. v1.0.0). |
context.protocol_slug | string | Protocol slug. |
context.protocol_version | string | Protocol version. |
context.pipeline_config | dict | A dictionary from strings to strings, containing the configuration parameters set when configuring the pipeline that invoked this protocol. |
context.input_file | dict | File pointer that triggered the current workflow. |
context.created_at | string | Workflow's creation time. ISO timestamp string. Generated by moment().toISOString(). Indicates when the file is created. |
context.task_id | uuid | ID for the current task (step of a pipeline). |
context.task_created_at | string | Task's creation time. ISO timestamp string. Generated by moment().toISOString(). Indicates whent the task is created. |
context.platform_url | string | The URL of the platform (i.e.: https://platform.tetrascience.com). |
context.platform_version | string | The version of the platform being used. |
context.tmp_dir | string | Folder for tmp dir. Usually it’s /tmp. If your task script wants to write a file and then call a shell command (like LibreOffice) to process the file, you should put it in context.tmp_dir . Other folders are not guaranteed to be writeable. |
Context API Functions
Context API provides functions that allow you to retrieve data, write data, and modify/check data on TDP.
Retrieving Data from Data Lake
Function | Description |
---|---|
context.read_file() | Takes an input file reference, and returns either the file content, a file object, or the local path of a copy of the file. |
context.get_labels() | Get all of the labels associated with a file. |
context.get_file_name() | Retrieve filename of a file that is not downloaded locally. |
context.get_ids() | Get the Intermediate Data Schema that matches the namespace, slug, and version. |
context.search_eql() | Searches files via Elasticsearch query language. |
context.get_logger() | Returns the logger object, so you can use log(data) to log an object. |
context.get_presigned_url() | Returns a time-limited HTTPS URL that can be used to access a file. |
context.get_secret_config_value() | Retrieves the secret's value. |
context.resolve_secret() | Returns the secret value. This function is used to convert the SSM reference to the actual secret value. |
Writing Data to Data Lake
Function | Description |
---|---|
context.write_file() | Writes an output file to the data lake. |
context.write_ids() | Writes an output IDS file to the data lake. |
Modifying/Checking Data in Data Lake
Function | Description |
---|---|
context.add_labels() | Allows you to add labels to a file. |
context.delete_labels() | Delete one or more labels from a file. |
context.update_metadata_tags() | Updates file's metadata and tags. |
context.add_attributes() | Add metadata, tags, or labels to an object. |
context.validate_ids() | Checks the validity of IDS content provided in data . |
context.run_command() | Invokes remote command/action on target (agent or connector) and returns its response. |
Context API Functions - Detailed Information
context.read_file
Takes an input file reference, and returns a dictionary with keys. By using the form
parameter, can control access to the file content, a file object, or the local path of a copy of the file.
Parameter(s)
Parameter | Type | Description |
---|---|---|
file | dict | File Pointer with keys: - bucket - fileKey - version Starting from ts-sdk v1.2.30 it is also possible to simply use one single key fileId (it will be internally resolved to full file pointer shown above). However, a fileId for an old version of a file will not work. It must be the fileId for the latest version. |
form | string (optional) | If form=body (the default), then result['body'] holds the contents of the file as a byte array. This approach cannot handle large files that don't fit in memory.If form= file_obj , then result['file_obj'] is a file-like object that can be used to access the body in a streaming manner. This object can be passed to Python libraries such as Pandas.If form= download , then result['download'] is the file name of a local file that has been downloaded from the specified data lake file. This is useful when the data needs to be processed by native code (e.g. SQLite) or an external utility program. |
Return(s)
Type | Description |
---|---|
dict | body, file_obj or download - exactly one of these keys is present, depending on the value of the form parameter. resulta": { is a byte array that holds the contents of the file. This approach cannot handle large files that don't fit in memory. result-1": "body, is a file-like object that can be used to access the body in a streaming manner. This object can be passed to Python libraries such as Pandas. result-2": "", is a string that is the file name of a local file that has been downloaded from the specified data lake file. This is useful when the data needs to be processed by native code (e.g. SQLite) or an external utility program. result-2": "Description" is a dict that is the custom metadata of the document results": 2, "rows" is a list that is a list of custom tags of the document |
Examples
body
f = context.read_file(input_file_pointer, form='body')
json.loads(f['body'])
file_ref = {
"fileId": "8e58867d-f281-4251-81ec-baeb3fdbb2f5"
}
data = context.read_file(file_ref)["body"]
file_obj
f = context.read_file(input_file_pointer, form='file_obj')
with zipfile.ZipFile(f['file_obj']) as zf:
with zf.open("somefile.csv", "r") as csvf:
pd.read_csv(csvf)
download
f = context.read_file(input_file_pointer, form='download')
con = sqlite3.connect(f['download'])
df = pd.read_sql_query('SELECT * FROM foo', con)
context.write_file
Writes an output file to the data lake.
Parameters
The following are the parameters for this endpoint.
Parameter | Type | Description |
---|---|---|
content | string, bytes object, byte stream, dict | The content to be written out. The content can only be Dict type when file_category is set to ids , otherwise it will error. In this case, IDS validation will be automatically performed. |
file_name | string | The name of the file to be written. The full S3 path is determined by the platform. As of ts-sdk v1.3.2, the name cannot navigate upward in the file path (for example, ../file_name ). |
file_category | string | File category, can be IDS , TMP or PROCESSED |
ids | string (required if file_category is "IDS", otherwise optional) | If file_category is "IDS", then this parameter specifies the specific Intermediate Data Schema. The format is: namespace/slug:v1.2.3. |
custom_metadata | dict of string to string (optional) | Metadata to be appended to a file. Keys can only include letters, numbers, spaces and symbols +, - or _ .Values can only include letters, numbers, spaces and symbols +, -, _, /, . or , .Including metadata with a key that already existing within the file's current metadata keys will result in a ValueError. |
custom_tags | list of strings (optional) | Tags to be appended to a file. Can only include letters, numbers, spaces, and the symbols +, -, ., /, or _) .Limited to 128 characters. Including any tags that already existing within the file's current tags will result in a ValueError. |
source_type | string (optional) | Can be used to overwrite the source type S3 metadata value of the resulting document. Validation is performed if a value is passed against the regex /^", "2+$/, if no value is passed, we fall back to the default logic (take source type value from the RAW file, or if not found on RAW file, set the value unknown). |
labels | list of dicts (optional) | List of dictionaries where each dictionary is in form { 'name': <LABEL_NAME>, 'value': <LABEL_VALUE> } |
gzip_compress_level | int (optional) | 1 is fastest and 9 is slowest. 0 is no compression. The default is 5 (introduced in ts-sdk v2.0.0) |
Returns
Type | Description |
---|---|
dict | Returns an object that indicates the file location. Example: { "type": "s3file", "bucket": "datalake", "fileKey": "path/to/file", "fileId": "uuidForFile", "version": "versionId" } Note: version will not be available when you are running pipelines locally. |
Examples
content: string, bytes object, byte stream or string
string
context.write_file(
content='this is a string',
file_name=file_name,
file_category=file_category
)
bytes object
context.write_file(
content=bytes('byte content', 'utf-8'),
file_name=file_name,
file_category=file_category
)
# or
context.write_file(
content=b'byte content',
file_name=file_name,
file_category=file_category
)
byte stream / file object
NOTE:
If a gzipped file is larger than 100MB, the file will be uploaded as multi-part upload.
For ts-sdk versions <= v1.2.31, the file size limit is 5GB. Starting with ts-sdk v1.2.32, the file size limit has been increased to 5TB.
with zf.open("somefile.csv", "rb") as f:
context.write_file(
content=f,
file_name=file_name,
file_category=file_category
)
context.write_ids
Writes an output IDS file to the data lake.
WARNING:
Notice you can only define
file_suffix
, you cannot define the complete file name of the output IDS file. The file name is codified and follows a certain pattern, which includesids_namespace
andids_slug
. This means if you switch to a different IDS, the new IDS JSON produced will be a separate file not a new version of the old file.
Parameters
This function contains the following parameters.
Parameter | Type | Description |
---|---|---|
content_obj | dict | Dictionary that can be JSON-serialized. |
file_suffix | string | String that will be appended to the IDS file name generated from the provided information (IDS type and version). Useful for keeping names unique when a task script generates multiple IDS files of the same type. Document will have a systematic file name generated: ${idstype}${idsversion}${file_suffix}. |
ids | string | If the file_category is "IDS" , then this parameter specifies the specific Intermediate Data Schema. The format is: namespace/slug:v1.2.3. |
custom_metadata | dict of string to string (optional) | Metadata to be appended to a file. Keys can only include letters, numbers, spaces and symbols +, - or _ .Values can only include letters, numbers, spaces and symbols +, -, _, /, . or , . |
custom_tags | list of strings (optional) | Tags to be appended to a file. Can only include letters, numbers, spaces, and the symbols +, -, ., /, or _) .Limited to 128 characters. |
source_type | string (optional) | Used to overwrite the source type S3 metadata value of the resulting document. Validation is performed if a value is passed against the regex /^[-a-z0-9]+$/ , if no value is passed, we fall back to the default logic (take source type value from the RAW file, or if not found on RAW file, set the value unknown). |
file_category | string (optional) | Defaults to to IDS. Overwrites file category. Allowed values are IDS and TMP. If value is not provided or if other values is provided, it will be defaulted to IDS. |
labels | list of dicts (optional) | List of dictionaries where each dictionary is in form { 'name': <LABEL_NAME>, 'value': <LABEL_VALUE> } |
gzip_compress_level | int (optional) | 1 is fastest and 9 is slowest. 0 is no compression. The default is 5 (introduced in ts-sdk v2.0.0) |
Returns
Type | Description |
---|---|
dict | Returns an object that indicates the file location. Example: { "type": "s3file", "bucket": "datalake", "fileKey": "path/to/file", "fileId": "uuidForFile", "version": "versionId" } Note: version will not be available when you are running pipelines locally. |
context.get_file_name
Use this function to retrieve the filename of a file that is not downloaded locally.
Parameters
The following table shows the parameters for this function.
Parameter | Type | Description |
---|---|---|
file | dict | File pointer |
Returns
Type | Description |
---|---|
string | The filename of the file. Note that you do not need to download the file locally to access the file name. |
Example(s)
def function_in_main(input: dict, context):
input_file = input["input_file_pointer"] # passed by protocol script.js
name = context.get_file_name(input_file)
context.get_ids
Use this function to get the Intermediate Data Schema that matches the namespace, slug, and version.
Parameters
The parameters for this function appear in the following table.
Parameter | Type | Description |
---|---|---|
namespace | string | A namespace defines a realm; only those with the appropriate permissions can use the artifacts in that realm. There are three main categories of namespace: • common • client • private |
slug | string | A slug is a unique identifier. The term is also used to denote a reference to a unique identifier (pointer). |
version | string | Version of the IDS. |
Returns
type | Description |
---|---|
dict | Returns an IDS schema object. |
context.validate_ids
Checks the validity of IDS content provided in data
. Throws an error if not valid.
Parameters
Parameter | Type | Description |
---|---|---|
data | dict | The JSON content of the IDS file. |
namespace | string | A namespace defines a realm; only those with the appropriate permissions can use the artifacts in that realm. There are three main categories of namespace: • common • client • private |
slug | string | A slug is a unique identifier. The term is also used to denote a reference to a unique identifier (pointer). |
version | string | Version of the IDS. |
Returns
Type | Description |
---|---|
boolean | Returns a boolean value indicating whether the IDS is valid (true). It throws an error if the IDS is not valid. |
Example(s)
is_ids = validate_ids(ids_data_dict,
namespace = "common",
slug = "cell-counter",
version = "v2.0.0")
context.get_logger
Returns the logger object, which currently has one method: log(data).
Parameters
There are no parameters for this function.
Returns
Type | Description |
---|---|
object | This returns the structured logger object, which currently has one method: log(input). |
Example(s)
log an object
logger = context.get_logger()
logger.log({
"message": "Starting the main parser",
"level": "info"
})
# log output
# {"timestamp": "2020-04-22T20:45:02.540947", "message": "Starting the main parser", "level": "info"}
log a string, the default level is “info”
logger = context.get_logger()
logger.log("Writing IDS into the datalake")
# log output
# {"timestamp": "2020-04-22T20:45:02.971038", "message": "Writing IDS into the datalake", "level": "info"}
context.get_secret_config_value
NOTE:
Instead of this function, we recommend as best practice to pass pipeline configs from your protocol's master-script script.js to your task script function and then using
context.resolve_secret
to resolve the secret. This makes your task script function less dependent on TScontext
function (more like a pure function). Please see the documentation for that function.
Retrieves the secret value from a configuration element for use in your script.
Parameters
Parameter | Type | Description |
---|---|---|
secret_name | string | Secret slug |
silent_on_error | boolean | Optional. Default: True . If set to True and the secret is missing, this function will return an empty string (otherwise an error will be thrown) |
Returns
Type | Description |
---|---|
str | The value of the secret. |
context.get_presigned_url
Returns a time-limited HTTPS URL that can be used to access the file. If URL generation fails for any reason (except invalid value for ttl_sec parameter). None will be returned.
Parameters
Parameters for this function appear in the following table.
Parameter | Type | Description |
---|---|---|
file | dict | File pointer |
ttl_sec | number (optional) | How long the URL will be valid before it expires. It is recommended that task scripts adjust this to be in line with the command TTL. Optional, defaults to 300. Must be between 0 and 900, otherwise an error is thrown. |
Returns
Type | Description |
---|---|
str | A time-limited HTTPS URL that can be used to access the file. |
context.run_command
Invokes remote command/action on target (agent or connector) and returns its response.
NOTE:
This function is still in the early stage and requires Tetra DataHub and Agent support. It's not an out-of-shelf function that you can use on any instrument. Please contact the TetraScience team for more information.
Parameters
Parameter | Type | Description |
---|---|---|
org_slug | string | organization slug |
target_id | string | Unique Identifier; identifies the connector or agent that will receive the command. It is recommended that task scripts receive this as a pipeline config parameter |
action | string, enum | See supported actions below. |
payload | dict | JSON object specific to the selected action |
metadata | dict (optional, default: {} ) | A dict that describes command. Method will automatically add values for workflowId, pipelineId and taskId to metadata that will be sent to command service for execution. |
ttl_sec | number (optional, default: 300 ) | It is recommended that task scripts receive this as a pipeline config parameter. Range: 300 seconds (5min) to 900 seconds (15min) |
Actions
TetraScience.Connector.gdc.HttpRequest
Send REST call to any network address as long as that address is reachable to GDC
- Requires TDP v3.0.0 and above
- Requires DataHub and Generic Data Connector (GDC)
- No agent needed
Returns
Type | Description |
---|---|
dict | Returns a JSON response object returned by target (agent or connector) that ran the command. |
Raises
Throws an error when the following are missing: org_slug, target_id, action, payload, and ttl_sec.
Throws an error when the ttl_sec is less than 300 seconds or higher than 900 seconds.
Error messages include the following:
- command not created successfully
- TTL has expired
- command didn’t execute successfully
For SDK version 1.2.22 and above: If the command does not execute successfully, the exception and exception message contains the reason why the command failed. Previously, it was "command status ERROR". Now the error message is "error response from target system."
context.update_metadata_tags
Updates file's metadata and tags.
Parameters
Parameter | Type | Description |
---|---|---|
file | dict | File pointer |
custom_meta | dict | Updates the metadata. Passing "None" for this parameter removes the metadata entry. |
custom_tags | list | List of new tags where each tag must be a string. |
Returns
Type | Description |
---|---|
dict | Returns an object that describes the file location. |
Example Return object
{
type: 's3file',
bucket: 'bucket',
fileKey: 'fileKey',
fileId: 'fileId',
// no version locally (from fake s3)
version: 'versionId'
}
Raises
ts-sdk versions <= v1.2.31
Raises an error if the file size is over the limit of 5GB. Additionally, the function will raise an exception if no custom_meta or custom_tags are provided.
ts-sdk versions v1.2.32 onward
Raises an error if the file size is over the limit of 5TB. Additionally, the function will return the unmodified file instead of raising an exception.
context.resolve_secret
Returns the secret value. This function is used to convert the AWS Systems Manager Parameter Store (SSM) reference to the actual secret value.
Parameters
Parameter | Type | Description |
---|---|---|
secret_ref | dict | It contains SSM parameter store path to the secret value. |
Example
If you defined a secret in protocol.json
as:
{
...
"config": [{
"slug": "password",
"name": "Password",
"description": "This is a password",
"type": "secret",
"required": true
}],
...
}
Then in the script.js, you will get password
reference from the pipeline config:
...
const pipelineConfig = workflow.getContext('pipelineConfig');
const { password } = pipelineConfig;
await workflow.runTask('task-1', { password });
...
Then in your task script, you can get the password reference from input["password"]
. The reference is a dictionary that contains SSM parameter store path to the secret value, like this:
{
"secret": true,
"ssm": "/development/tetrascience-demo/org-secrets/password",
"version": 1
}
def main(input, context):
password = context.resolve_secret(input.get('password'))
...
Returns
Returns the secret value. If you pass in a parameter that is not a secret, it will simply return the parameter.
context.add_labels
Allows you to add labels to a file.
If the file already has a label with the same key name, then a new label with the same key name will be created. The previous label with the same name will not be overwritten.
Parameters
Parameters appear in the table below.
Parameter | Type | Description |
---|---|---|
file | dict | Contains either (1) fileId or (2) bucket, fileKey and version (optional) of a file you want to add labels to. |
labels | list of dict | List of dictionaries where each dictionary is in form { 'name': <LABEL_NAME>, 'value': <LABEL_VALUE> } |
Returns
Type | Description |
---|---|
list of dicts | Returns an list of added labels, where each label item shows that label's id and creation time. e.g. ta": { "h-0": "Type", "h-1": "Description", "0-0": "list of dicts", "0-1": "Ret` |
context.get_labels
Use this function to get all of the labels associated with a file.
Parameters
Parameter | Type | Description |
---|---|---|
file | dict | Contains either (1) fileId or (2) bucket, fileKey and version (optional) of a file that you want view label information for. |
Returns
Type | Description |
---|---|
list of dicts | Returns an list dicts related to the given file, where each dict contains information about a label. e.g. ta": { "h-0": "Type", "h-1": "Description", "0-1": "Returns an list dicts related t` |
context.delete_labels
Use this function to delete one or more labels from a file. Labels are discussed in detail in the Basic Concepts: Metadata, Tags, and Labels topic.
Parameters
Parameter | Type | Description |
---|---|---|
file | dict | Can contain either (1) fileId or (2) bucket, fileKey and version (optional) of a file that we want to delete labels from. |
label_ids | List(str) | IDs of the labels you want to delete. |
Returns
Type | Description |
---|---|
object | An empty object if the label was successfully deleted from the file. |
context.add_attributes
This function has been added with TDP v3.0.0 and ts-sdk v1.2.20. This function allows you to add metadata, tags, or labels to an object.
Parameters
Parameter | Type | Description |
---|---|---|
file | dict | A dict that describes the file location. E.g. { type: 's3file', bucket: 'datalake', fileKey: 'path/to/file', fileId: 'uuidForFile', version: 'versionId'} |
custom_meta | dict | Optional. List of metadata values and keys. Passing in a value of "None" will remove all metadata from the file. |
custom_tags | list | Optional. List of tags. |
labels | list | Optional. List of dictionaries where each dictionary is in form{ 'name': <LABEL_NAME>, 'value': <LABEL_VALUE> } |
Returns
Type | Description |
---|---|
dict | Returns a dict that describes the file location. A new file version will be generated if metadata/tags were provided. |
context.search_eql
This function has been added with TDP v3.1.0 and ts-sdk v1.2.24. It helps to search files via Elasticsearch query language. For more details, check search by Elasticsearch Query Language
Parameters
Parameter | Type | Description |
---|---|---|
payload | dict | Elasticsearch Query Language (EQL). You can read more about the query format that Elasticsearch uses on their website |
returns (added in ts-sdk v1.2.30) | str | - raw (default): raw response from ES- filePointers : returns list of file pointers that can be used in context.read_file |
Returns
varies if returns= raw then dictif returns= filePointers then list | Result of the ES query (raw or file pointers). |
Updated about 1 year ago