What is the Tetra Data Lake?

The cloud-based Tetra Data Lake is an important component of the Tetra Data Platform (TDP). It contains:

  • RAW data
  • Standardized data in the Intermediate Data Schema (IDS)
  • Search index
  • Graph representation of the data
  • (sometimes) Data in tabular formats to facilitate SQL queries

It is built on top of Amazon Simple Storage Service (Amazon S3) and leverages Elasticsearch to index all the files. Additionally, it provides a very flexible Web API. Amazon Athena is used to provide a JDBC and ODBC connection to your data.

TetraScience chose Amazon S3 to leverage its:

  • Availability
  • Cost
  • Compatibility with many popular big data and machine learning frameworks (such as, Hadoop and Spark).

For example, Apache Hadoop ships with a connector to Amazon S3 called "S3A" (more details), and Spark provide integrations with Amazon S3 (more details).

When determining which storage service to use, we chose to use Amazon S3 instead of HDFS for Hadoop based on these important considerations:

  • Cost - In terms of storage costs, Amazon S3 is five times less expensive than HDFS. Based on our experience managing petabytes of data, Amazon S3’s human cost is virtually zero; whereas, it may take a team of Hadoop engineers or vendor support to maintain HDFS. After TetraScience factored in the additional human cost, Amazon S3 is ten times less expensive than HDFS clusters on EC2 with comparable capacity.
  • Elasticity - The main benefit of Amazon S3 (cloud storage) is its elasticity and pay-as-you-go pricing model where you are only charged based on what data you put in. If you need to add more date, then you just add it. The cloud provider automatically provisions resources on demand. Amazon S3 is elastic; where HDFS is not.
  • Availability and Durability - With automatic data replication across multiple data centers within different AWS Availability Zones, Amazon S3’s availability and durability are best in class.

For additional details about these comparisons, see: