This content, written by Ori Raphael & Roy Hegdish, was initially posted in Looker Blog on Mar 10, 2020. The content is subject to limited support.
Amazon Athena is one of the most widely growing services in the Amazon Cloud, often used as part of a cloud data lake for ad-hoc querying, analytics, and data science on both structured and semi-structured data.
In this article, we’ll walk through a few key data preparation best practices that can have a major impact on query performance in Amazon Athena. We’ll also share some benchmarks of how applying these best practices using Upsolver’s data lake ETL platform can result in improved performance and fresher, more up-to-date data in dashboards built on AWS Athena - all while reducing querying costs.
To understand why Athena performance matters, let’s look at the following example: An IT organization is running servers in different regions and needs to continuously monitor their performance for outages and spikes in resource utilization, which could explode the company’s infrastructure costs.
Since their data consists of log files, we suggest using a streaming architecture where events are processed by Kinesis before being written to S3. When dealing with very high volumes of semi-structured data and wanting to avoid the complex and costly process of ETLing data into a relational database, leveraging S3 as the storage layer rather than a database such as Amazon Redshift makes the most business sense, since Athena allows you to query data directly from S3.
Looker is then used to visualize and analyze the results of the queries that are run in Athena. This enables users to view critical metrics for resource utilization across the various AWS regions where they have machines running.
The IT managers who use the Looker dashboard want it to refresh frequently to show up-to-date data and allow them to address issues in near-real time. In addition, they’re interested in reducing the total costs of running this dashboard.
To address this, let’s first cover some basics about Athena and the unique characteristics of ETL for Athena. From there we can solve the challenge using data preparation for Athena.
One of the main advantages of Athena compared to a traditional database such as Redshift is that it is offered as a completely managed service: there are no servers to manage, and AWS automatically provisions resources as needed to execute any given query.
You can then use business intelligence tools, such as , to visualize and explore the data. Since there is no infrastructure to manage, Athena can be used as the basis for quickly launching new analytical dashboards and applications.
If you just need an answer to a single query and don’t care too much about the costs or the time it takes, you can run Athena without any kind of ETL. It will generally retrieve the query, even if not in the most efficient way.
However, where things get tricky is when you need Athena to perform consistently at scale. This can be critical for BI data flows — as in our example — and especially when you expect your dashboard to refresh frequently to reflect changes in real/near real-time. In these cases where performance and costs are important factors, you will need to prepare the data before querying it in Athena. Basically, we need ETL.
Because Athena is a compute engine rather than a database, ETL for Athena is different than database ETL. Since we don’t have things like indexes, upserts, or delete APIs, we’ll need to do the ETL separately over the data stored on S3.
ETL for Athena can be done using Apache Spark running on Amazon EMR or similar solutions. In this case, we will be using to ingest data to S3 and continuously optimize the data for querying in Athena (as well as other databases and analytics tools such as Redshift and Redshift Spectrum).
While there are many best practices you can use to , we’ll focus on the ones that give us the most ‘bang for our buck’ and can produce dramatic improvement in query performance and costs. We’ll then go through some benchmarks that quantify said improvements. Upsolver automatically applies these data preparation best practices as data is ingested and written to S3, but theoretically you could code a similar solution in Spark manually if you have the prerequisite expertise in Scala and time to continuously maintain pipelines.
Returning to our initial reference architecture, streaming data from the various servers is streamed via Amazon Kinesis and written to S3 as raw CSV files, with each file representing a single log. In order to improve the performance of our Athena queries and the Looker dashboard that sits on top of them, we’ll apply the following data preparation techniques:
Athena is priced at $5 per terabyte scanned. To reduce the costs of Athena, we want to reduce the amount of data scanned. Upsolver uses compaction, partitioning, columnar storage and compression to achieve this. In the examples below, reducing the amount of data scanned translates to an equivalent reduction in costs.
In order to understand how each of these factors affects the performance of our queries, we’ll look at some “before and after” benchmarks for each of the queries that power the resource utilization dashboard.
Running on raw CSV, Athena queries returned in 12-18 seconds. This was improved to 2.75 - 5.47 seconds in the most optimized version of the data. On average, queries running on the most optimized version of the data returned 3.8 times faster than on the raw CSV.
Optimizing the data had a significant impact on the amount of data scanned, which translates into cost savings for scanning larger volumes of data. The largest change was from CSV to Parquet, which cut down the amount of data scanned from gigabytes to a few dozen megabytes for most queries, andre-aggregating the data further cut down this figure to less than 1 megabyte.
Below are each of the SQL queries and the corresponding performance we achieved with Athena when running that query on:
In all cases except the CSVs, the Upsolver platform partitioned the data by time and compressed it using Snappy.
This is the original SQL we ran in Athena:
SELECT timestamp, count(Distinct tags_host)
FROM demo.server_usage_reduced
Group BY timestamp
These are the results after optimizing the data:
CSV | Parquet | Parquet + Compaction | Aggregated | |
---|---|---|---|---|
Query time (seconds) | 16.15 | 11.42 | 6.06 | 5.47 |
Data scanned (megabytes) | 27294.52 | 220.06 | 220.06 | 0.84 |
The SQL we ran in Athena:
SELECT timestamp, sum(fields_n_cpus)/60
FROM demo.server_usage_reduced
Group BY timestamp
Results after optimizing the data:
CSV | Parquet | Parquet + Compaction | Aggregated | |
---|---|---|---|---|
Query time (seconds) | 13.72 | 7.63 | 5.56 | 3.96 |
Data scanned (megabytes) | 25440 | 21.03 | 21.03 | 0.56 |
SQL we ran in Athena:
SELECT timestamp, tags_aws_region, count(Distinct tags_host)
FROM demo.server_usage_reduced
Group BY timestamp, tags_aws_region
Results after optimizing the data:
CSV | Parquet | Parquet + Compaction | Aggregated | |
---|---|---|---|---|
Query time (seconds) | 18.13 | 8.41 | 4.95 | 2.75 |
Data scanned (megabytes) | 25450 | 226.34 | 226.34 | 0.77575 |
SQL we ran in Athena:
SELECT timestamp, tags_aws_region, sum(fields_n_cpus)/60
FROM demo.server_usage_reduced
Group BY timestamp, tags_aws_region;
Results after optimizing the data:
CSV | Parquet | Parquet + Compaction | Aggregated | |
---|---|---|---|---|
Query time (seconds) | 17.33 | 9.04 | 5.84 | 4.54 |
Data scanned (megabytes) | 25460 | 27.12 | 27.12 | 0.57 |
SQL we ran in Athena:
SELECT tags_aws_region, count(Distinct tags_cluster_id)
FROM demo.server_usage_reduced
Group BY tags_aws_region
Results after optimizing the data:
CSV | Parquet | Parquet + Compaction | Aggregated | |
---|---|---|---|---|
Query time (seconds) | 16.23 | 8.09 | 5.82 | 3.76 |
Data scanned (megabytes) | 25480 | 226.46 | 226.46 | 0.81 |
SQL we ran in Athena:
SELECT tags_aws_region, sum(fields_n_cpus)/60
FROM demo.server_usage_reduced
Group BY tags_aws_region
Results after optimizing the data:
CSV | Parquet | Parquet + Compaction | Aggregated | |
---|---|---|---|---|
Query time (seconds) | 12.37 | 8.16 | 5.02 | 4.2 |
Data scanned (megabytes) | 25490 | 27.15 | 27.15 | 0.57 |
Want to unlock the value of and Upsolver? Visit Upsolver to learn more about , or try a to learn more about the .