Disaster Recovery for Tetra Hosted Multi-Tenant Deployments

To preserve data and restore service following a catastrophic event that renders a Tetra hosted multi-tenant production site inoperable for an extended period of time, TetraScience creates disaster recovery (DR) environments by default.

DR sets up the infrastructure required for resuming operations on a secondary site (DR site), following a disaster that has greatly impaired the main production site.

Regional Redundancy

Each Tetra hosted environment runs in a specific AWS Region, but is highly redundant because each Region has multiple, isolated locations known as Availability Zones (AZ). Because of this underlying infrastructure, the platform will continue to operate as normal if a platform component in one AZ goes down, or if an entire AZ fails.

For more information, see AWS Global Infrastructure in the AWS documentation.

Data Replication

Each Tetra hosted environment is also configured in a second AWS Region in a different geography (DR Region) for Region-wide disaster recovery purposes. All data within the TDP environment, including all user files and platform state, is constantly replicated to the DR Region.

All artifacts (Docker images, configuration files, etc.) required for running the TDP are replicated to all supported Regions, for all generally available platform versions.

📘

NOTE

For deployments in the European Union (EU), data is not replicated in AWS Regions outside of the EU. For US deployments, data is not replicated in Regions outside of the United States.

Recovery Procedure

The TDP installation process is fully automated by using Infrastructure as Code (IaC). The recovery procedure, performed in the DR Region, is similar to a new TDP installation for all stateless components. However, instead of starting with an empty configuration, the recovery procedure uses the replicated data that is available in the DR Region.

Recovery Time Objective

Recovery time objective (RTO) is the maximum acceptable time interval between DR initiation and the resuming of operations.

The RTO for Tetra hosted environments is 12 hours.

Recovery Point Objective

Recovery point objective (RPO) is the maximum acceptable data loss, measured in time from the moment of disaster.

Because of the different techniques used in data replication, the RPO varies between different data types. The standard RPO time frame for platform components are listed in the following table.

Standard RPO Time Frame for TDP Components

ComponentRPO Values
Raw and processed files in the TDP15 minutes
Configurations (for example, pipeline settings, user permissions, and event history)12 hours
File indexing (which supports search functionality)6 hours

Disaster Recovery Testing

A disaster recovery test is performed for each TDP release that contains infrastructure-related changes, or on a yearly basis; whichever comes first.

The disaster recovery test consists of recovering a TDP environment in the DR Region from the replicated data, and then performing data validation with the data in the Tetra hosted production environment.

📘

NOTE

The Tetra hosted production environment continues to run normally and is not affected by the disaster recovery test in any way.