This content, written by Erin Franz, was initially posted in Looker Blog on Jan 19, 2017. The content is subject to limited support.
Today we are proud to that with our newest release, we now support Amazon Athena. Looker on Amazon Athena allows users across an organization to derive insights and easily make data-driven decisions directly from their AWS data lake.
So what is a data lake, really? A data lake is a single repository for ALL of an organization’s data, regardless of source or format. Structured, semi-structured, and unstructured data can all be stored in the same place. Data lakes can include data that you’re using today, data that you plan to use in the future, and even data with an as-yet-unknown purpose that you might find a use for someday. Ideally, all data for all time is stored in one place so the entirety of your historical data is available for analysis. With all data available, theoretically, any question can be answered with data to serve all of an organization’s data consumers.
Historically, commodity servers combined with cheap storage made storing data, from terabyte to petabyte scale, economical. Amazon offers similar low cost, scalable storage in the cloud via S3 that makes it easy to store and retrieve any amount of data from any source. In theory, data lake solutions like S3 sound ideal.
However, while storing the data is relatively easy, analyzing it has been a different story. Analysis on data stored in data lakes comes with some significant challenges. For instance, a common solution is to set up a Hadoop cluster, which is difficult and requires resources with specific skills to build and maintain. Another solution to analyze data in a data lake is to ETL (extract, transform, load) the data to a data warehouse, but oftentimes this leaves data consumers with only subsets or aggregates of the data lake’s information.
Difficult access and inability to query all the data in a data lake has left the value of analytics on S3 out of reach for many. Enter Amazon Athena.
Amazon Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. This is how Amazon Athena has tackled existing problems with analyzing data in S3:
Athena is a managed service. That means that no infrastructure or admin is required. Queries are tuned for performance and are automatically executed in parallel utilizing a cost-per-query model.
No ETL is required. Data can be queried directly where it lives in S3. Athena utilizes schema-on-read and can leverage files in diverse formats like txt, csv, JSON, weblogs, and even AWS service logs. ORC and Parquet formats can be used for even increased performance.
Complex query at any granularity is made possible via SQL. Athena uses Presto as its SQL query engine. Complex joins, nested queries, window functions, complex data types like arrays and structs, and partitioning by any key are all supported to query your most granular data in S3.
Looker took an early bet on SQL as the lingua franca for data analysis. We developed a product that directly leverages the underlying power and functionality of SQL dialects, and already has full support for the including RDS, Redshift, and EMR via Spark SQL, Hive and Presto. Now, either in conjunction with these engines or separately, you can leverage Looker on Athena to make data in S3 available across your organization. Looker doesn’t move your data from S3, it directly leverages the power of Athena to query the data where it lives.
Through a centralized LookML data model, all users, regardless of SQL ability, can access this data in a governed way that enables them to act on and share insights in a modern web environment. With Looker, anyone in your organization can explore, analyze and take action on the data. The Looker application’s native UI offers highly flexible and visualization capabilities, but it doesn’t end there.
Looker is a complete data platform: it can be embedded in third party applications like Salesforce, used to create custom web apps as part of a product offering, or even accessed via the API to visualize results in messaging tools like Slack. Looker can further complete the loop by visualizing data collected from these tools that has been dumped in S3.
Look for Amazon Athena support in our upcoming Looker release. If you’re not already exploring your data with Looker, we’d love to !