Cloud-based Data Lake Overview

The cloud-based Data Lake is an important component of the Tetra Data Platform (TDP). It contains:

  • RAW data
  • Standardized data in the Intermediate Data Schema (IDS)
  • Search index
  • Graph representation of the data
  • (sometimes) Data in tabular formats to facilitate SQL queries

It is built on top of AWS S3 and leverages Elasticsearch to index all the files. Additionally, it provides a very flexible Web API. AWS Athena is used to provide a JDBC and ODBC connection to your data.

TetraScience chose AWS S3 to leverage its:

  • Availability
  • Cost
  • Compatibility with many popular big data and machine learning frameworks (such as, Hadoop and Spark).

For example, Apache Hadoop ships with a connector to S3 called "S3A" (more details), and Spark provide integrations with AWS S3 (more details).

When determining which storage service to use, we chose to use S3 instead of HDFS for Hadoop based on these important considerations:

  • Cost - In terms of storage costs, S3 is five times less expensive than HDFS. Based on our experience managing petabytes of data, S3’s human cost is virtually zero; whereas, it may take a team of Hadoop engineers or vendor support to maintain HDFS. After TetraScience factored in the additional human cost, S3 is ten times less expensive than HDFS clusters on EC2 with comparable capacity.
  • Elasticity - The main benefit of S3 (cloud storage) is its elasticity and pay-as-you-go pricing model where you are only charged based on what data you put in. If you need to add more date, then you just add it. The cloud provider automatically provisions resources on demand. S3 is elastic; where HDFS is not.
  • SLA (Availability and Durability) - With cross-AZ replication that automatically replicates across different data centers, S3’s availability and durability are far superior to HDFS.

For additional details about these comparisons, see: