JDBC and SQL via AWS Athena

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.

TetraScience has designed its Data Lake using AWS S3. TetraScience's Data Connectors and Data Pipelines collect data from instruments, CRO, and enterprise databases, standardize the data into TetraScience's Intermediate Data Schema (IDS) and prepare the data such that it is ready to be queried via Athena.

Connect to Amazon Athena

Amazon Athena provides JDBC driver for connections. Any tool that can use JDBC driver can connect to Amazon Athena (SQL Workbench, IntelliJ IDEA, and so on). Also, Java can be used to access Athena programmatically. To find the JDBC drivers download address and JDBC usage details, click here.

📘

JDBC Driver Version

Depending on Java version (JDK 8.0 or JDK 9.0), you must use a different JDBC driver version!

If you change the version of the IDS that you are using in a pipeline, new version-specific tables are created which include the processing result. If you query the results, it may be possible that you see duplicate information from results generated by the old and new versions of IDS.

To avoid having duplicate results and return a distinct set of values, TetraScience recommends that you use the UNION clause. The UNION clause enables you to combine two separate SELECT statements to produce a result set as a union of both the SELECT statements. To use the UNION clause:

  • The fields used in both SELECT statements must be in same order, have the same number, and have the same data type.
  • The UNION clause produces distinct values in the result set
SELECT column_name(s) FROM lcuv_empower_v8_injection UNION SELECT column_name(s) FROM lcuv_empower_v9_injection;

Result set consists of distinct values.

where:

lcuv_empower_v8_injection = table generated by the first IDS version 8

lcuv_empower_v9_injection = table generated by the second IDS version 9

1. Obtain Connection Details from the Tetra Data Platform

Connection details for Amazon Athena used by the Tetra Data Platform can be easily obtained from platform. From account settings menu it upper right corner of the screen, select Account menu item:

Then, from the left menu, select menu item Data:

System will display Amazon user credentials created especially for our organization. Username and password are displayed. Note that password appears only initially when organization is created for the first time. After that, it is hidden. Therefore, store password for later usage.

Credentials can be re-created again by clicking on button Replace Credentials. By clicking on this button, new username is generated and new password is generated and is displayed. Be aware that new credentials will replace the old ones and the old ones will not be usable anymore.

Under username and password sections, URL address for Athena is displayed together with required connection parameter named S3OutputLocation.

Write down username, password, URL and S3OutputLocation in order to connect to Amazon Athena used by platform and query the data.

Under connection settings, the system displays all the tables in Athena tablespace that belong to your organization:

If you are self-hosting the platform in your AWS accounts and set the deployment parameter AthenaCreateIamUser to false, then the credentials are not available in the platform UI (as described previously).
To obtain the JDBC credentials:

  1. Locate the TDP orgSlug in the platform UI.
  2. In AWS IAM, locate a policy named: ts-platform-<awsRegion>-production-<orgSlug>-policy.
  3. Create an IAM user or role in the AWS account where TDP is hosted and assign it to the policy.
  4. Use the credentials of the role/user to access Athena by following the AWS documentation located here.

2. Connect to Athena from the IDE Tools (IntelliJ IDEA)

Add the JDBC driver for Athena in DataGrip.
a) Once DataGrip is launched click on the Add button (+ sign)

b) Click on Driver. Name your Driver. In this case I named mine “Athena - Demo”.
c) Add the Athena driver file downloaded in step 1 by clicking on the + sign below the Driver files box, select “Custom JARs”, navigate to the downloaded file, upload.
d) Once the driver file is uploaded, you can select the right Class “com.simba.athena.jdbc42.Driver”.

e) Under Options, select the Dialect to “Generic SQL”.

f) Under Advanced, fill up “AwsRegion”, and the “S3OutputLocation”. The AWS Region is the region your Athena bucket is found, in our case it is “us-east-2”. For the “S3OutputLocation”, you obtained that field in step 1.

g) Once finished filling those fields up, click on “Apply”, then “OK”.

Add a new Data Source.
a) Click on the Add button (+ sign). Select “Data Source” and find the driver you created.

b) Fill up the User, Password and URL fields using the data obtained in step 1. Click Test Connection. Once you are connected (the green checkmark), click OK.

c) Now you are ready to access your data using DataGrip’s console.

3. Connect to Athena from Java Spring

In order to connect to Athena from Spring application, a datasource must be configured. After the dataSource is configured, it can be used in combination with jdbcTemplate as any other jdbc data source.

Example of Java-based Spring configuration

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.env.Environment;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import javax.sql.DataSource;
import java.util.Properties;

@Configuration
public class ApplicationConfig {
    @Autowired
    private Environment env;

    @Bean
    public DataSource dataSource(){
        DriverManagerDataSource dataSource = new DriverManagerDataSource();
        dataSource.setDriverClassName("com.simba.athena.jdbc.Driver));
        dataSource.setUrl(env.getProperty("jdbc.url"));
        dataSource.setUsername(env.getProperty("jdbc.username"));
        dataSource.setPassword(env.getProperty("jdbc.password"));
        // Without this property, Athena will not work
        Properties props = new Properties();
        props.put("S3OutputLocation",env.getProperty("jdbc.S3OutputLocation"));
        dataSource.setConnectionProperties(props);

        return dataSource;
    }
}

4. Connect to Athena from TIBCO Spotfire

Before connecting Spotfire to Amazon Athena, Athena JDBC driver must be deployed to TIBCO Spotfire. Download JDBC drive from Amazon AWS Athena page for appropriate Java versions, deploy downloaded JDBC JAR file to Spotfire's Tomcat, restart Spotfire server and configure Data Source Template. All these steps are described in Spotfire documentation at following URL:

https://community.tibco.com/wiki/tibco-spotfire-jdbc-data-access-connectivity-details#toc-11

Once TIBCO Spotfire server is configured, in order to connect to Amazon Athena and use it as data source, follow the steps below from TIBCO Spotfire client.

  1. In Spotfire Client, from the Tools menu, select Information Designer item
  1. Select Data Sources tab and on the right side click on Setup Data Source
  1. Provide connection details obtained from the Tetra Data Platform. Note that connection URL contains also parameters and it should be in following form:

jdbc:awsathena://AwsRegion=us-east-2;S3OutputLocation=s3://ts-athena-results-us-east-2-staging/demo;athena.us-east-2.amazonaws.com:443

  1. When all fields are filled-in, click on Save As... and save as new Data Source at appropriate location.

NOTE: If there is an error in connection (wrong URL, username, password, missing parameter), system will not allow saving the data source and error message will be displayed.

  1. Remaining steps give an example on how to create simple visualization using Athena tables as data source. The first step is to map table column to Spotfire column. We can either map column one by one by clicking on Column link, or, we can map multiple table columns at once by clicking on Multiple Columns. In this scenario, we will map multiple columns at once.

Click on Multiple Columns.

  1. Dialog is changed and on the left side, we have to navigate to Athena table we want to map to Spotfire. Select one column, or multiple columns, or whole table and click on Add > button.
  1. Click on button Create Columns.... Spotfire will display dialog where to save mapped columns. For better organization, it is preferable to save the columns in dedicated subfolder for given project.
    Click on Create Columns.
  1. Next step is to create information link - it is actual data source - SQL query that is executed to fetch the data from Athena. Click on Create Information Link.
  1. New dialog is displayed. Information link accepts only mapped columns that are saved in Spotfire library. Therefore, on the left side, click on Elements tab and navigate to columns created in previous step. Select desired columns and add them to the right side by clicking on the button Add >.
    This screen contains additional options, like setting parameters, prompt values, join paths, filters, caching, etc. Click on SQL... button displays SQL statement that will be executed to fetch the data.

SQL is editable and can be modified if needed. Close the SQL dialog and click on Save As... button.

  1. System will ask for the location where to save information link. Again, it is recommended to save the link at appropriate location.
  1. Next step in creating data visualization is to select Add Data Tables... from File menu.
  1. Data table supports multiple options as data source. In this case, we want to use information link we created previously. Therefore, click on Add button and select Information Link... from the menu.
  1. In the library, navigate to location where information link is saved in step 10.
  1. System will provide several options for data table. Click OK button.

At this point, Spotfire will trigger SQL and will fetch the data from Athena. Depending on the source size and provided SQL, query can take time.

  1. Once Spotfire loaded data from the source, it will show dialog with visualization suggestions. Suggestions depend on selected columns. We can either accept suggested visualization, or we can click on Close button and manually add visualizations.
  1. All available visualizations can be found on palette at the top.
  1. At the beginning, we can add simple table. Table will include all the columns and is basically the purest representation of what we have in table.
  1. At this point we can also add Filter from palette. Filter includes all columns and system will recognize text and numerical columns. For numerical columns, filters are in form of range bar. For text columns, depending on number of different values filter will display either checkboxes, or select list, or simple edit box. Changing the filters will automatically change the values in table and other visualizations.

5. Connect to Athena from Tableau

Please refer to the URL below. Please note that all the information necessary to connect is found in step 1 of this documentation. You need to have the driver downloaded and saved in the folder they indicate in the blog before setting up the connection.

6. Connect to Athena Using Python

Below is a Python code snippet you can copy and paste into your Python environment and run. All you need to do is:

  1. Install necessary Python modules
  2. Replace everything in '<>' with the value from your account
pipenv install PyAthena
pipenv install pandas

All parameters in the code snippet below, with the exception of schema_name, can be found in Account > Data > SQL access.

# import necessary modules
from pyathena import connect
import pandas as pd

# define connection
conn = connect(
    aws_access_key_id="<SQL access > Username>", # shared across the entire org
    aws_secret_access_key="<SQL access > Password>", # shared across the entire org
    region_name="<SQL access > Show additional connection parameters > URL > The string found after 'AwsRegion='>", # e.g. us-west-2
    schema_name="<Account > Organization > ORG SLUG>", # e.g. tetrascience
    s3_staging_dir="<SQL access > Show additional connection parameters > S3OutputLocation>" # should start with s3://
)

# define query
query = """
SELECT *
FROM <org slug>.<table name>;
"""

# fetch data
df = pd.read_sql_query(query, conn)

print(df.head())

Security

Every organization on Tetra Data Platform is automatically provisioned with a separate AWS KMS (Key Management Service) Key.

The query result is encrypted using SSE_KMS (The driver uses server-side encryption with an AWS KMS-management Key).

Additionally, the JDBC driver automatically applies SSL encryption to all connections. SSL encryption protects data and credentials when they are transferred over the network, and provides stronger security than authentication alone. The SSL indicates both TLS (Transport Layer Security) and SSL (Secure Socket Layer). The drier support industry-standard versions of TLS/SSL.

The TLS version that the driver support depending on the JVM version installed. The TLS version used for the connections is the highest version that is supported by the both driver and the server.

Reference:
Oracle Blog: Diagnosing TLS, SSL, and HTTPS


Did this page help you?