Tetra Data Platform Health Monitoring

You can use the Health Monitoring dashboards to assess the health and performance of the Tetra Data Platform (TDP) components.

Access and View the Overall TDP Health Monitoring Dashboard

The overall TDP Health Monitoring Dashboard (named Dashboard) provides an end-to-end snapshot of the components' health for the entire TDP ecosystem. Additionally, you can view the performance and quick health visualizations for each of these TDP components:

To access and view the overall TDP Health Monitoring Dashboard:

  1. Log in to TDP using an Administrator user account.
  2. Click the profile icon at the top right of the page, and then select Health Monitoring.
Health Monitoring optionHealth Monitoring option

Health Monitoring option

The Health Monitoring screen displays with the Dashboard tab selected by default. If not, then click Dashboard to view an end-to-end snapshot of the components' health for the entire TDP ecosystem.

Overall Health Monitoring DashboardOverall Health Monitoring Dashboard

Overall Health Monitoring Dashboard

Each dashboard is divided into two sections:

  • The top half of the dashboard displays a statistical visualization of how many instances of a component are in a healthy, unhealthy, or critical state.
Statistical Visualization ExampleStatistical Visualization Example

Statistical Visualization Example

Possible States

Definition

Example

Healthy

Indicates that the component is operating optimally within specified parameters. What is considered optimal differs by component.

A healthy pipeline has a run time that is less than one standard deviation from the average run time for the last five runs. However, a healthy connector makes a connection within three attempts.

Unhealthy

Indicates that the component is not operating optimally, but has not failed. What is considered optimal differs by component.

An unhealthy Data Hub has a memory usage value that is greater than 80% but less than, or equal to 90%, for the past five contiguous minutes.

Critical

Indicates that the component has failed or is well outside the specified parameters. Similar to the Healthy and Unhealthy states, the exact elements that contribute to the Critical state differs by component.

If the percentage of used disk space for a Tetra Agent is greater than 90%, then that component is in a critical state.

  • The bottom half of the dashboard displays a detailed list of component instances that you can filter by type. Additionally, some dashboards provide historical status data.
Component Instances Details ListComponent Instances Details List

Component Instances Details List

File Statistics

This section provides an aggregate count of:

  • Files Uploaded: Number of files that were uploaded by all members of the organization.
  • Workflows Triggered: Number of workflows that were launched.
  • Files Indexed: Number of files that were indexed.
  • Files in Athena: Number of files that are in Athena; you can use SQL to query Athena files.
  • Files Failed Indexing: Number of files that have failed to index.

Overall File Health

This section indicates the overall file health status: Healthy, Unhealthy, or Critical.

Problems

This section is an aggregate list of all components with critical issues. To review information for components that are not in the critical state, you must view the specific dashboard for each TDP component. You can filter problems by these types: All, Agents, Data Hubs, Data Hub Connectors, Cloud Connectors, and Pipelines. The list of problems include the following information for each component:

Field

Description

Health

Status; for the overall TDP Health Monitoring Dashboard only Critical issues display.

Name

Name of the component instance that is currently in the critical state.

To view additional details about the component (such as CPU, Memory, and Disk Usage), can you hover over the name. The information that displays is customized base on component type. For example, connectors show the memory and last time that a status was received (last contact); whereas the Tetra Agents show CPU, Memory, Disk Usage statistics and last contact.

To copy the unique ID for the component instance, you can click the copy file icon.

Health Description

Explains why the component has been assigned the critical state. By default, only one issue is shown. If there is more than one issue, then a link displays (for example, +1 More) indicating there are additional issues to review.

Link

Provides a link you can click to review the configuration details for the TDP component.

View the Tetra Agents Health Dashboard

The Tetra Agents Health Dashboard provides statistics for these Windows-based Agents:

  • Tetra Chromeleon Agent
  • Tetra Empower Agent
  • Tetra File-Log Agent
  • Tetra LabX Agent
  • Tetra UNICORN Agent

📘

Tetra Agent Details

To learn more about each of these Tetra Agents, click here.

From the Health Monitoring screen, click the Agents tab to view the Tetra Agents Health Dashboard:

Tetra Agents DashboardTetra Agents Dashboard

Tetra Agents Dashboard

The aggregate status and numbers of Healthy, Unhealthy, and Critical Tetra Agents display as a graphic at the top of the screen. A Tetra Agent may exist in these possible states:

State

Event

Healthy

A Tetra Agent is in a Healthy state when:

  • Online: A status from the Tetra Agent was received within the past 5 minutes and/or a file was received in the past 40 minutes.
  • File Transmission: Files are being transmitted because a status from the Tetra Agent was received within the past 5 minutes and/or a file was received in the past 40 minutes.
  • Environment: Percentage of disk space used is less than or equal to 80%, the percentage of memory used is less than or equal to 80%, and/or CPU usage is less than or equal to 80%.

Unhealthy

A Tetra Agent is in an Unhealthy state when:

  • Online: A status from the Tetra Agent was not received within the past 5 minutes but a file was received in the past 40 minutes.
  • File Transmission: An intermittent status from the Tetra Agent for more than 3 but less than 5 minutes, and a file has not been received in the past 20 minutes.
  • Environment: Percentage of disk space used is greater 80% but less than or equal to 90%, the percentage of memory used is greater than 80% but less than or equal to 90%, and/or CPU usage is greater than 80% but less than or equal to 90%.

Critical

A Tetra Agent is in a Critical state when:

  • Online: A status from the Tetra Agent was not received within the past 5 minutes and a file was received in the past 40 minutes.
  • File Transmission: The upload error rate is greater than 70% per all upload events, and the scan access rate is greater than 70% within the past hour. The scan access rate is the ability of the Tetra Agent to access a particular folder or drive.
  • Environment: Percentage of disk space used is greater than 90%, the percentage of memory used is greater than 90%, and/or CPU usage is greater than 90%.
  • To search for a component name, you can enter all (or a portion) of the Tetra Agent’s name, unique identifier (UID), or type in the Search box.
  • To apply a filter by status, you can also select All, Critical, Unhealthy, or Healthy next to the Search box.

This table describes the Tetra Agents individual components and additional information:

Field

Description

Health

Status; for the Tetra Agents Health Monitoring Dashboard only Critical issues display.

Name

Name of the component instance that is currently in the critical state.

To view additional details about the component (such as CPU, Memory, and Disk Usage), can you hover over the name. The information that displays is customized base on component type. For example, connectors show the memory and last time that a status was received (last contact); whereas the Tetra Agents show CPU, Memory, Disk Usage statistics, and last contact.

To copy the unique ID for the component instance, you can click the copy file icon.

Latest Status

When the latest status was assigned. To review a component's status history from the past month, click the View History link below the status.

Health Description

Explains why the component has been assigned the critical state. By default, only one issue is shown. If there is more than one issue, then a link displays (for example, +1 More) indicating there are additional issues to review.

Link

Provides a link you can click to review the configuration details for the TDP component.

To review a Tetra Agent's status history over the past month, click the View History link in the Latest Status section. This table describes the status history information that displays:

Field

Description

Time

Time that the status was recorded for historical purposes.

Change

Status when a change occurred.

Errors

Errors or issues that indicate the reason for the status.

Warnings

Warnings that indicate the reason for the status. Typically, warnings display when the component changes from a Healthy state to an Unhealthy state.

View the Datahub(s) Connectors Health Dashboard

The Datahub(s) Connectors Health Dashboard provides statistical information on the health of the data hubs and their connectors installed on the Tetra Data Platform.

From the Health Monitoring screen, click the Datahub(s) Connectors tab to view the Datahub(s) Connectors Health Dashboard:

Datahub(s) Connectors Health DashboardDatahub(s) Connectors Health Dashboard

Datahub(s) Connectors Health Dashboard

The aggregate status of Healthy, Unhealthy, and Critical datahubs and their associated connectors display as graphics at the top of the screen. A datahub and their associated connectors may exist in these possible states:

State

DataHub or DataHub Connector

Event

Healthy

A DataHub is in a Healthy state when:

  • Online: The last status received was 3 or less minutes ago.
    • Environment: Percentage of disk space used is less than or equal to 80%, the percentage of memory used is less than or equal to 80%, and/or CPU usage is less than or equal to 80%.

DataHub Connector is in a Healthy state when:

  • Online: The last status received was 3 or less minutes ago.
    • Environment: Percentage of memory used is less than or equal to 80%.

Unhealthy

A DataHub is in an Unhealthy state when:

  • Online: A status has not been received in greater than 3 but less than 5 minutes.
    • Environment: Percentage of disk space used is greater than 80% but less than or equal to 90%, the percentage of memory used is greater than 80% but less than or equal to 90%, and/or CPU usage is greater than 80% but less than or equal to 90%.

A DataHub Connector is in an Unhealthy state when:

  • Online: A status has not been received in more than 3 minutes but less than or equal to 5 minutes.
    • Environment: Memory usage is greater than 80% but less than or equal to 90% for the past 5 contiguous minutes.

Critical

A DataHub is in a Critical state when:

  • Online: The status has not been received for more than 5 minutes.
    • Environment: Disk percentage used is greater than 90%, memory percentage used is greater than 90%, and CPU used is greater than 90%.

A DataHub Connector is in a Critical state when:

  • Online: A status has not been received in the past 5 minutes.
    • Environment: Percentage of memory used is greater than 90% for the past 5 contiguous minutes.
  • To search for a component name, you can enter all (or a portion) of the DataHub's name or unique identifier (UID) in the Search box.
  • To apply a filter by status, you can also select All, Critical, Unhealthy, or Healthy next to the Search box.

This table describes the health details for the individual DataHub or DataHub Connector:

Field

Description

Health

Status for the Datahub(s) or Datahubs Connector.

Name

Name of the component instance that is currently in the critical state.

To view additional details about the component (such as CPU, Memory, and Disk Usage), can you hover over the name. The information that displays is customized base on component type. For example, connectors show the memory and last time that a status was received (last contact); whereas the Tetra Agents show CPU, Memory, Disk Usage statistics, and last contact.

To copy the unique ID for the component instance, you can click the copy file icon.

Latest Status

When the latest status was assigned. To review a component's status history from the past month, click the View History link below the status.

Health Description

Explains why the component has been assigned the critical state. By default, only one issue is shown. If there is more than one issue, then a link displays (for example, +1 More) indicating there are additional issues to review.

Link

Provides a link you can click to review the configuration details for the TDP component.

View the Cloud Connectors Health Dashboard

The Cloud Connectors Dashboard provides statistical information on the health of the cloud connectors.

From the Health Monitoring screen, click the Cloud Connectors tab to view the Cloud Connectors Health Dashboard:

Cloud Connectors Health DashboardCloud Connectors Health Dashboard

Cloud Connectors Health Dashboard

The aggregate status of Healthy, Unhealthy, and Critical cloud connectors display as a graphic at the top of the screen. A cloud connector may exist in these possible states:

State

Event

Healthy

A Cloud Connector is in a Healthy state when:

  • Online: The last status received was 3 or less minutes ago.
    • Environment: Percentage of memory used is less than or equal to 80%.

Unhealthy

A Cloud Connector is in an Unhealthy state when:
Idle integrations have a waiting time of 5 times the polling interval and/or active integrations have a processing time that matches the polling interval.

Critical

A Cloud Connector is in a Critical state when:

  • Connection: A connection cannot be made after 3 consecutive attempts.
  • Waiting/Processing Time: When the idle integrations have a waiting time of more than 5 times the polling interval and/or active integrations have a processing time that exceeds the polling interval.
  • To search for a component name, you can enter all (or a portion) of the Cloud Connector name or unique identifier (UID) in the Search box.
  • To apply a filter by status, you can also select All, Critical, Unhealthy, or Healthy next to the Search box.

This table describes the health details for the individual Cloud Connector:

Field

Description

Health

Status for the Cloud Connector.

Name

Name of the component instance. To copy the unique ID for the component instance, you can click the copy file icon.

Latest Status

When the latest status was assigned. To review a component's status history from the past month, click the View History link below the status.

Health Description

Explains why the component has been assigned the critical state. By default, only one issue is shown. If there is more than one issue, then a link displays (for example, +1 More) indicating there are additional issues to review.

Link

Provides a link you can click to review the configuration details for the TDP component.

View the Pipelines Health Dashboard

The Pipelines Health Dashboard provides statistical information on the health of the data source connectors that are part of the Tetra Data Platform.

From the Health Monitoring screen, click the Pipelines tab to view the Pipelines Health Dashboard:

Pipelines Health DashboardPipelines Health Dashboard

Pipelines Health Dashboard

The aggregate status of Healthy, Unhealthy, and Critical TDP pipelines display as a graphic at the top of the screen. A pipeline may exist in these possible states:

State

Event

Healthy

A pipeline is in a Healthy state when:

  • Failures: Less than 20% of the workflows have failed in the past 24 hours.
  • Run Time: Run time is less than one standard deviation from the average run time for the last 5 runs. For example, if the mean of the past 5 runs is 32, and the standard deviation is 5.7, then run time should be between 26.3 – 37.7 seconds to be considered healthy.

Unhealthy

A pipeline is in an Unhealthy state when:

  • Failures: More than or equal to 20%, but less than 60% of the workflows have failed in the past 24 hours.
  • Run Time: Run time is greater than one standard deviation but less than 2 standard deviations of the average run time for the last 5 runs.

Critical

A pipeline is in a Critical state when:

  • Failures: All pipelines have failed in the past hour and/or more than 60% of the workflows have failed in the past 24 hours.
  • Run Time: Run time is greater than two standard deviations of the average run time for the last 5 runs.
  • To search for a component name, you can enter all (or a portion) of the Pipeline name or unique identifier (UID) in the Search box.
  • To apply a filter by status, you can also select All, Critical, Unhealthy, or Healthy next to the Search box.

This table describes the health details for the individual Pipeline:

Field

Description

Health

Status for the Pipeline.

Name

Name of the component instance. To copy the unique ID for the component instance, you can click the copy file icon.

Latest Status

When the latest status was assigned. To review a component's status history from the past month, click the View History link below the status.

Average Runtime

Average runtime for the past five pipeline runs.

Health Description

Explains why the component has been assigned the critical state. By default, only one issue is shown. If there is more than one issue, then a link displays (for example, +1 More) indicating there are additional issues to review.

Link

Provides a link you can click to review the configuration details for the TDP component.

View the Files Health Dashboard

The Files Health Dashboard provides statistical information on the health of the files that are part of the Tetra Data Platform.

From the Health Monitoring screen, click the Files tab to view the Files Health Dashboard:

Files Health DashboardFiles Health Dashboard

Files Health Dashboard

The File Statistics section provides an aggregate count of:

  • Files Uploaded: Number of files that were uploaded by all members of the organization.
  • Workflows Triggered: Number of workflows that were launched.
  • Files Indexed: Number of files that were indexed.
  • Files in Athena: Number of files that are in Athena; you can use SQL to query Athena files.
  • Files Failed Indexing: Number of files that have failed to index.

The Overall File Health section indicates the overall file health status: Healthy, Unhealthy, or Critical.

  • To search for a component name, you can enter all (or a portion) of the file name in the Search box.
  • To view files that have been uploaded to the data lake (S3 bucket), click the From and To fields to select a date range.
  • To view files that have been indexed, click the From and To fields to select a date range.

This table describes the health details for the individual file:

Field

Description

Name

Name of the file

Date Uploaded

Date and time the file was uploaded

Date Indexed

Date and time the file was indexed

Source

Source of the file

Link

Unique file ID. To copy the unique file ID to a clipboard, click the copy file icon.


Did this page help you?