IDS Design - athena.json

❗️

Location Changed

The content of this page has been moved to TetraScience Confluence and it's not public-facing anymore. We will revise this page once self-service IDS is available.

Platform Requirements

WARNING: athena.json is not yet validated by the IDS validator

#RuleChecked by IDS Validator
1The partition name cannot collide with any of the normalized IDS paths. See details in Partition PathNo
2The partition path cannot be a field in an array of objects. See details in Partition PathNo

TS Convention

Bare-minimum

Copy the template below into your athena.json file

{
    "root": "root",
    "partitions": []
}

Make sure you have the correct partitions defined. Follow the rest of this tutorial to learn how to define partitions.

For most common use cases, keep "root": "root"

  • If you want to name your root table differently, for example "root": "injection", the root table will be called <idsType>_<idsMajorVersionNumber>_injection

If you want to be familiar with Athena partition and how we use it, check out the Partition section in .

Partition

Partitioning can be used to improve Athena performance. Data is partitioned based on the value of the chosen field(s) given in the partitions property.

A partition is defined by an object which has a path and a name. Objects are placed into the partitions array in athena.json

{
  "path": "sample.year",
  "name": "year"
}

Multiple partitions can be created. For example, if an IDS contains values for year and month you can create two partitions.

Ordering

If you are creating more than one partition, ordering within the partitions array is important. Place the main partition first, then any sub-partitions later.

For example, if partitioning by year and month, create:

{
  "root": "root",
  "partitions": [{
    "path": "sample.year",
    "name": "year"
  }, {
    "path": "sample.month",
    "name": "month"
  }]
}

Partition Path

path defines the location of the value within an IDS file. Partitions can be created only from paths (columns) which are available in the root table. That means any property whose path is not nested inside an array.

Valid path: sample.batch_name (sample is an object)
Invalid path: samples[*].batch_name (samples is an array of objects)

Partition Name

The name defines what the partition column will be called.

🚧

Warning

Do not name the partition the same as the normalized value of the IDS path, or any other path in the IDS. You can find more details in Athena table/column/partition name normalization

Valid case: sample.batch_name (sample is an object)

{
  "path": "sample.batch_name",
  "name": "sample_batch"
}

Invalid case:

{
  "path": "sample.batch_name",
  "name": "sample_batch_name"
}

Choice of partition

When choosing a property to be the partition, consider several factors:

  • How will data be balanced across partitions
  • How many values will partition generate

A good choice for a partition is a property that will split approximately equally between partitions, that will not generate too many partitions and that will be a useful column to filter by in an SQL query.

Good choices: year, project, location
Bad choices: sample_value, created_at (because they are too unique)

Partition Examples

Example 1: No Partitioning

Partition Examples - Example 1 No Partitioning

IDS Type: example

IDS Version: v1.0.0

schema.json

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "class": { "type": "string" },
    "results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "value": { "type": "number" },
          "unit": { "type": "string" }
        }
      }
    }
  }
}

athena.json

{
    "root": "experiments"
}

A combination of schema.json and athena.json are used to generate a file containing the definitions of the data transformation which occurs when any IDS of this type and version arrives in the datalake. This file is called normalized.json and it contains all the information necessary to create Athena tables as well as transform incoming JSON into CSV.

s3://<athena-bucket>/<orgSlug>/example/v1/normalized.json

Whatever value you enter as the root, will be sanitized according to the following rule

Resulting Tables

table 1 name: example_v1_experiments

uuidnameclass

table 2 name: example_v1_results

uuidparent_uuidvalueunit

Data

When an IDS file arrives, CSV files are generated.

Incoming file: /folder/test/0.json

{
    "name": "Experiment #1",
    "class": "distillation",
    "results": [{
        "value": 1829,
        "unit": "s"
    }, {
        "value": 1.23,
        "unit": "l"
    }]
}

Generated CSV Files

All files will be found under s3://<athena-bucket>/<orgSlug>/

  • /example/v1/experiments/foldertest0.json.csv.gz
  • /example/v1/results/foldertest0.json.csv.gz
  • /metadata/v1/foldertest0.json.csv.gz

Query results

Query:

SELECT * FROM example_v1_experiments;

returns:

uuidnameclass
4be5d31a-587c-4974-8653-728f167df201Experiment #1distillation

Query:

SELECT * FROM example_v1_results;

returns:

uuidparent_uuidvalueunit
254b74ab-9269-4a9b-8049-9c2a9fa9722f4be5d31a-587c-4974-8653-728f167df2011829s
c3098033-fdac-4125-a7f1-8f362e67e9ff4be5d31a-587c-4974-8653-728f167df2011.23l

Example 2: With Partitioning

Partition Examples - Example 2 With Partitioning

IDS Type: example

IDS Version: v2.0.0

schema.json (same as Example 1)

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "class": { "type": "string" },
    "results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "value": { "type": "number" },
          "unit": { "type": "string" }
        }
      }
    }
  }
}

athena.json

{
  "root": "experiments",
  "partitions": [{
    "name": "experiment_class",
    "path": "class"
  }]
}

Resulting Tables

table 1 name: example_v2_experiments

uuidnameclassexperiment_class

table 2 name: example_v2_results

uuidparent_uuidvalueunitexperiment_class

Notice experiment_class is a virtual column with type String

Data

When an IDS file arrives, CSV files are generated.

Incoming file: /folder/test/0.json (same as Example 1)

{
    "name": "Experiment #1",
    "class": "distillation",
    "results": [{
        "value": 1829,
        "unit": "s"
    }, {
        "value": 1.23,
        "unit": "l"
    }]
}

Generated CSV Files

All files will be found under s3://<athena-bucket>/<orgSlug>/:

  • /example/v2/experiments/experiment_class=distillation/foldertest0.json.csv.gz
  • /example/v2/results/experiment_class=distillation/foldertest0.json.csv.gz
  • /metadata/v1/foldertest0.json.csv.gz

Query results

Query:

SELECT * FROM example_v2_experiments;

returns:

uuidnameclassexperiment_class
2be5d31a-587c-4974-8653-728f167df201Experiment #1distillationdistillation

Query:

SELECT * FROM example_v2_results;

returns:

uuidparent_uuidvalueunitexperiment_class
a54b74ab-9269-4a9b-8049-9c2a9fa9722f2be5d31a-587c-4974-8653-728f167df2011829sdistillation
a3098033-fdac-4125-a7f1-8f362e67e9ff2be5d31a-587c-4974-8653-728f167df2011.23ldistillation

Athena table/column/partition name normalization

Athena has restrictions on what tables/columns/partitions can be named:

  • lowercase letter and number
  • no special characters, except underscore _

Because IDS keys can potentially include uppercase and special chars we need to sanitize the name to conform to Athena spec. The transformations rules are applied:

  • to lowercase
  • replace all special chars with underscore
  • remove repeated underscores
  • remove leading underscore

Example 1:

IDS property path: _Weird-partition!@name
Athena column name: weird_partition_name

Example 2:

IDS property path: @fileId
Athena column name: fileid

Example 3:

IDS property path: project.name
Athena column partition: project_name

Athena value sanitization

Make sure you escape: line break, return character, escape characters, comma in the value

Avoid using array of string literal that contains ","

Example

["A,B", "C"] will be serialized into A,B,C