IDS Design - athena.json
Location Changed
The content of this page has been moved to TetraScience Confluence and it's not public-facing anymore. We will revise this page once self-service IDS is available.
Platform Requirements
WARNING: athena.json is not yet validated by the IDS validator
# | Rule | Checked by IDS Validator |
---|---|---|
1 | The partition name cannot collide with any of the normalized IDS paths. See details in Partition Path | No |
2 | The partition path cannot be a field in an array of objects. See details in Partition Path | No |
TS Convention
Bare-minimum
Copy the template below into your athena.json file
{
"root": "root",
"partitions": []
}
Make sure you have the correct partitions defined. Follow the rest of this tutorial to learn how to define partitions.
For most common use cases, keep "root": "root"
- If you want to name your root table differently, for example
"root": "injection"
, the root table will be called<idsType>_<idsMajorVersionNumber>_injection
If you want to be familiar with Athena partition and how we use it, check out the Partition section in .
Partition
Partitioning can be used to improve Athena performance. Data is partitioned based on the value of the chosen field(s) given in the partitions property.
A partition is defined by an object which has a path and a name. Objects are placed into the partitions
array in athena.json
{
"path": "sample.year",
"name": "year"
}
Multiple partitions can be created. For example, if an IDS contains values for year and month you can create two partitions.
Ordering
If you are creating more than one partition, ordering within the partitions
array is important. Place the main partition first, then any sub-partitions later.
For example, if partitioning by year and month, create:
{
"root": "root",
"partitions": [{
"path": "sample.year",
"name": "year"
}, {
"path": "sample.month",
"name": "month"
}]
}
Partition Path
path
defines the location of the value within an IDS file. Partitions can be created only from paths (columns) which are available in the root table. That means any property whose path is not nested inside an array.
Valid path: sample.batch_name
(sample
is an object)
Invalid path: samples[*].batch_name
(samples
is an array of objects)
Partition Name
The name defines what the partition column will be called.
Warning
Do not name the partition the same as the normalized value of the IDS path, or any other path in the IDS. You can find more details in Athena table/column/partition name normalization
Valid case: sample.batch_name
(sample
is an object)
{
"path": "sample.batch_name",
"name": "sample_batch"
}
Invalid case:
{
"path": "sample.batch_name",
"name": "sample_batch_name"
}
Choice of partition
When choosing a property to be the partition, consider several factors:
- How will data be balanced across partitions
- How many values will partition generate
A good choice for a partition is a property that will split approximately equally between partitions, that will not generate too many partitions and that will be a useful column to filter by in an SQL query.
Good choices: year, project, location
Bad choices: sample_value, created_at (because they are too unique)
Partition Examples
Example 1: No Partitioning
Partition Examples - Example 1 No Partitioning
IDS Type: example
IDS Version: v1.0.0
schema.json
{
"type": "object",
"properties": {
"name": { "type": "string" },
"class": { "type": "string" },
"results": {
"type": "array",
"items": {
"type": "object",
"properties": {
"value": { "type": "number" },
"unit": { "type": "string" }
}
}
}
}
}
athena.json
{
"root": "experiments"
}
A combination of schema.json and athena.json are used to generate a file containing the definitions of the data transformation which occurs when any IDS of this type and version arrives in the datalake. This file is called normalized.json and it contains all the information necessary to create Athena tables as well as transform incoming JSON into CSV.
s3://<athena-bucket>/<orgSlug>/example/v1/normalized.json
Whatever value you enter as the root, will be sanitized according to the following rule
Resulting Tables
table 1 name: example_v1_experiments
uuid | name | class |
---|---|---|
table 2 name: example_v1_results
uuid | parent_uuid | value | unit |
---|---|---|---|
Data
When an IDS file arrives, CSV files are generated.
Incoming file: /folder/test/0.json
{
"name": "Experiment #1",
"class": "distillation",
"results": [{
"value": 1829,
"unit": "s"
}, {
"value": 1.23,
"unit": "l"
}]
}
Generated CSV Files
All files will be found under s3://<athena-bucket>/<orgSlug>/
- /example/v1/experiments/foldertest0.json.csv.gz
- /example/v1/results/foldertest0.json.csv.gz
- /metadata/v1/foldertest0.json.csv.gz
Query results
Query:
SELECT * FROM example_v1_experiments;
returns:
uuid | name | class |
---|---|---|
4be5d31a-587c-4974-8653-728f167df201 | Experiment #1 | distillation |
Query:
SELECT * FROM example_v1_results;
returns:
uuid | parent_uuid | value | unit |
---|---|---|---|
254b74ab-9269-4a9b-8049-9c2a9fa9722f | 4be5d31a-587c-4974-8653-728f167df201 | 1829 | s |
c3098033-fdac-4125-a7f1-8f362e67e9ff | 4be5d31a-587c-4974-8653-728f167df201 | 1.23 | l |
Example 2: With Partitioning
Partition Examples - Example 2 With Partitioning
IDS Type: example
IDS Version: v2.0.0
schema.json (same as Example 1)
{
"type": "object",
"properties": {
"name": { "type": "string" },
"class": { "type": "string" },
"results": {
"type": "array",
"items": {
"type": "object",
"properties": {
"value": { "type": "number" },
"unit": { "type": "string" }
}
}
}
}
}
athena.json
{
"root": "experiments",
"partitions": [{
"name": "experiment_class",
"path": "class"
}]
}
Resulting Tables
table 1 name: example_v2_experiments
uuid | name | class | experiment_class |
---|
table 2 name: example_v2_results
uuid | parent_uuid | value | unit | experiment_class |
---|
Notice experiment_class
is a virtual column with type String
Data
When an IDS file arrives, CSV files are generated.
Incoming file: /folder/test/0.json
(same as Example 1)
{
"name": "Experiment #1",
"class": "distillation",
"results": [{
"value": 1829,
"unit": "s"
}, {
"value": 1.23,
"unit": "l"
}]
}
Generated CSV Files
All files will be found under s3://<athena-bucket>/<orgSlug>/
:
- /example/v2/experiments/experiment_class=distillation/foldertest0.json.csv.gz
- /example/v2/results/experiment_class=distillation/foldertest0.json.csv.gz
- /metadata/v1/foldertest0.json.csv.gz
Query results
Query:
SELECT * FROM example_v2_experiments;
returns:
uuid | name | class | experiment_class |
---|---|---|---|
2be5d31a-587c-4974-8653-728f167df201 | Experiment #1 | distillation | distillation |
Query:
SELECT * FROM example_v2_results;
returns:
uuid | parent_uuid | value | unit | experiment_class |
---|---|---|---|---|
a54b74ab-9269-4a9b-8049-9c2a9fa9722f | 2be5d31a-587c-4974-8653-728f167df201 | 1829 | s | distillation |
a3098033-fdac-4125-a7f1-8f362e67e9ff | 2be5d31a-587c-4974-8653-728f167df201 | 1.23 | l | distillation |
Athena table/column/partition name normalization
Athena has restrictions on what tables/columns/partitions can be named:
- lowercase letter and number
- no special characters, except underscore
_
Because IDS keys can potentially include uppercase and special chars we need to sanitize the name to conform to Athena spec. The transformations rules are applied:
- to lowercase
- replace all special chars with underscore
- remove repeated underscores
- remove leading underscore
Example 1:
IDS property path: _Weird-partition!@name
Athena column name: weird_partition_name
Example 2:
IDS property path: @fileId
Athena column name: fileid
Example 3:
IDS property path: project.name
Athena column partition: project_name
Athena value sanitization
Make sure you escape: line break, return character, escape characters, comma in the value
Avoid using array of string literal that contains ","
Example
["A,B", "C"]
will be serialized into A,B,C
Updated about 1 year ago