One of the important components of the Tetra Data Platform is its cloud-based Data Lake. It holds the raw data, standardized data in Intermediate Data Schema (IDS), search index, a graph representation of the data and sometimes data in tabular formats (to facilitate SQL queries).
It is built on top of AWS S3, AWS Neptune and leverages Elasticsearch to index all the files and provides a very flexible Web API. AWS Athena is used to provide a JDBC and ODBC connection to your data.
We made this choice to leverage the AWS S3 availability and cost and its compatibility with many popular big data and machine learning frameworks, such as Hadoop and Spark.
There have been many discussions to use S3 instead of HDFS for Hadoop. Here are some of the most important considerations:
In terms of storage cost alone, S3 is 5X cheaper than HDFS. Based on our experience managing petabytes of data, S3’s human cost is virtually zero, whereas it usually takes a team of Hadoop engineers or vendor support to maintain HDFS. Once we factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with comparable capacity.
One of the nicest benefits of S3, or cloud storage in general, is its elasticity and pay-as-you-go pricing model: you are only charged what you put in, and if you need to put more data in, just dump them there. Under the hood, the cloud provider automatically provisions resources on demand. Simply put, S3 is elastic, HDFS is not.
With cross-AZ replication that automatically replicates across different data centers, S3’s availability and durability are far superior to HDFS’.
You can read more about these comparisons here
Updated 3 months ago