IDS Design - athena.json

Suggest Edits

❗️
Location Changed
The content of this page has been moved to TetraScience Confluence and it's not public-facing anymore. We will revise this page once self-service IDS is available.

Platform Requirements

WARNING: athena.json is not yet validated by the IDS validator

#	Rule	Checked by IDS Validator
1	The partition name cannot collide with any of the normalized IDS paths. See details in Partition Path	No
2	The partition path cannot be a field in an array of objects. See details in Partition Path	No

TS Convention

Bare-minimum

Copy the template below into your athena.json file

{
    "root": "root",
    "partitions": []
}

Make sure you have the correct partitions defined. Follow the rest of this tutorial to learn how to define partitions.

For most common use cases, keep "root": "root"

If you want to name your root table differently, for example "root": "injection", the root table will be called <idsType>_<idsMajorVersionNumber>_injection

If you want to be familiar with Athena partition and how we use it, check out the Partition section in .

Partition

Partitioning can be used to improve Athena performance. Data is partitioned based on the value of the chosen field(s) given in the partitions property.

A partition is defined by an object which has a path and a name. Objects are placed into the partitions array in athena.json

{
  "path": "sample.year",
  "name": "year"
}

Multiple partitions can be created. For example, if an IDS contains values for year and month you can create two partitions.

Ordering

If you are creating more than one partition, ordering within the partitions array is important. Place the main partition first, then any sub-partitions later.

For example, if partitioning by year and month, create:

{
  "root": "root",
  "partitions": [{
    "path": "sample.year",
    "name": "year"
  }, {
    "path": "sample.month",
    "name": "month"
  }]
}

Partition Path

path defines the location of the value within an IDS file. Partitions can be created only from paths (columns) which are available in the root table. That means any property whose path is not nested inside an array.

Valid path: sample.batch_name (sample is an object)
Invalid path: samples[*].batch_name (samples is an array of objects)

Partition Name

The name defines what the partition column will be called.

🚧
Warning
Do not name the partition the same as the normalized value of the IDS path, or any other path in the IDS. You can find more details in Athena table/column/partition name normalization

Valid case: sample.batch_name (sample is an object)

{
  "path": "sample.batch_name",
  "name": "sample_batch"
}

Invalid case:

{
  "path": "sample.batch_name",
  "name": "sample_batch_name"
}

Choice of partition

When choosing a property to be the partition, consider several factors:

How will data be balanced across partitions
How many values will partition generate

A good choice for a partition is a property that will split approximately equally between partitions, that will not generate too many partitions and that will be a useful column to filter by in an SQL query.

Good choices: year, project, location
Bad choices: sample_value, created_at (because they are too unique)

Partition Examples

Example 1: No Partitioning

Partition Examples - Example 1 No Partitioning

IDS Type: example

IDS Version: v1.0.0

schema.json

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "class": { "type": "string" },
    "results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "value": { "type": "number" },
          "unit": { "type": "string" }
        }
      }
    }
  }
}

athena.json

{
    "root": "experiments"
}

A combination of schema.json and athena.json are used to generate a file containing the definitions of the data transformation which occurs when any IDS of this type and version arrives in the datalake. This file is called normalized.json and it contains all the information necessary to create Athena tables as well as transform incoming JSON into CSV.

s3://<athena-bucket>/<orgSlug>/example/v1/normalized.json

Whatever value you enter as the root, will be sanitized according to the following rule

Resulting Tables

table 1 name: example_v1_experiments

uuid	name	class

table 2 name: example_v1_results

uuid	parent_uuid	value	unit

Data

When an IDS file arrives, CSV files are generated.

Incoming file: /folder/test/0.json

{
    "name": "Experiment #1",
    "class": "distillation",
    "results": [{
        "value": 1829,
        "unit": "s"
    }, {
        "value": 1.23,
        "unit": "l"
    }]
}

Generated CSV Files

All files will be found under s3://<athena-bucket>/<orgSlug>/

/example/v1/experiments/foldertest0.json.csv.gz
/example/v1/results/foldertest0.json.csv.gz
/metadata/v1/foldertest0.json.csv.gz

Query results

Query:

SELECT * FROM example_v1_experiments;

returns:

uuid	name	class
4be5d31a-587c-4974-8653-728f167df201	Experiment #1	distillation

Query:

SELECT * FROM example_v1_results;

returns:

uuid	parent_uuid	value	unit
254b74ab-9269-4a9b-8049-9c2a9fa9722f	4be5d31a-587c-4974-8653-728f167df201	1829	s
c3098033-fdac-4125-a7f1-8f362e67e9ff	4be5d31a-587c-4974-8653-728f167df201	1.23	l

Example 2: With Partitioning

Partition Examples - Example 2 With Partitioning

IDS Type: example

IDS Version: v2.0.0

schema.json (same as Example 1)

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "class": { "type": "string" },
    "results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "value": { "type": "number" },
          "unit": { "type": "string" }
        }
      }
    }
  }
}

athena.json

{
  "root": "experiments",
  "partitions": [{
    "name": "experiment_class",
    "path": "class"
  }]
}

Resulting Tables

table 1 name: example_v2_experiments

uuid	name	class	experiment_class

table 2 name: example_v2_results

uuid	parent_uuid	value	unit	experiment_class

Notice experiment_class is a virtual column with type String

Data

When an IDS file arrives, CSV files are generated.

Incoming file: /folder/test/0.json (same as Example 1)

{
    "name": "Experiment #1",
    "class": "distillation",
    "results": [{
        "value": 1829,
        "unit": "s"
    }, {
        "value": 1.23,
        "unit": "l"
    }]
}

Generated CSV Files

All files will be found under s3://<athena-bucket>/<orgSlug>/:

/example/v2/experiments/experiment_class=distillation/foldertest0.json.csv.gz
/example/v2/results/experiment_class=distillation/foldertest0.json.csv.gz
/metadata/v1/foldertest0.json.csv.gz

Query results

Query:

SELECT * FROM example_v2_experiments;

returns:

uuid	name	class	experiment_class
2be5d31a-587c-4974-8653-728f167df201	Experiment #1	distillation	distillation

Query:

SELECT * FROM example_v2_results;

returns:

uuid	parent_uuid	value	unit	experiment_class
a54b74ab-9269-4a9b-8049-9c2a9fa9722f	2be5d31a-587c-4974-8653-728f167df201	1829	s	distillation
a3098033-fdac-4125-a7f1-8f362e67e9ff	2be5d31a-587c-4974-8653-728f167df201	1.23	l	distillation

Athena table/column/partition name normalization

Athena has restrictions on what tables/columns/partitions can be named:

lowercase letter and number
no special characters, except underscore _

Because IDS keys can potentially include uppercase and special chars we need to sanitize the name to conform to Athena spec. The transformations rules are applied:

to lowercase
replace all special chars with underscore
remove repeated underscores
remove leading underscore

Example 1:

IDS property path: _Weird-partition!@name
Athena column name: weird_partition_name

Example 2:

IDS property path: @fileId
Athena column name: fileid

Example 3:

IDS property path: project.name
Athena column partition: project_name

Athena value sanitization

Make sure you escape: line break, return character, escape characters, comma in the value

Avoid using array of string literal that contains ","

Example

["A,B", "C"] will be serialized into A,B,C

Updated over 1 year ago

IDS Design - athena.json

❗️
Location Changed

Platform Requirements

TS Convention

Bare-minimum

Partition

Ordering

Partition Path

Partition Name

🚧
Warning

Choice of partition

Partition Examples

Resulting Tables

Data

Generated CSV Files

Query results

Resulting Tables

Data

Generated CSV Files

Query results

Athena table/column/partition name normalization

Athena value sanitization

❗️Location Changed

Platform Requirements

TS Convention

Bare-minimum

Partition

Ordering

Partition Path

Partition Name

🚧Warning

Choice of partition

Partition Examples

Resulting Tables

Data

Generated CSV Files

Query results

Resulting Tables

Data

Generated CSV Files

Query results

Athena table/column/partition name normalization

Athena value sanitization

❗️
Location Changed

🚧
Warning