Using Amazon Redshift Materialized Views with Look...

LookerExperts · ‎03-27-2022

This content, written by Bruce Sandell, was initially posted in Looker Blog on Jul 2, 2020. The content is subject to limited support.

Materialized Views (MVs) allow data analysts to store the results of a query as though it were a physical table. Amazon Redshift recently , providing a useful and valuable tool for data analysts, because they allow analysts to compute complex metrics at query time with data that has already been aggregated, which can drastically improve query performance.

Historically in Redshift, users have been able to store the results of a query to a physical table using the CREATE TABLE AS SELECT (CTAS) syntax. While this still allows users to query pre-aggregated data, MVs have an important benefit over CTAS tables. MVs can be updated incrementally using the REFRESH MATERIALIZED VIEW command, whereas a table created using CTAS has to be completely recreated each time. And because tables created using CTAS can include joins across large tables and complex aggregations, it may take several minutes for it to get recreated. The time it takes to refresh a MV, however, is typically measured in seconds or milliseconds, and it still provides support for all of the joins and complex aggregations.

Using AWS MVs with Looker

The increased query performance that can be achieved with MVs is highlighted in this AWS , where the authors demonstrate how MVs can be used to speed up analytic queries by pre-aggregating data. Using that blog as inspiration, I thought it would be interesting to show how to make use of MVs with Looker using that same data set.

In this post, I’ll run the setup script supplied in the AWS blog post, use Looker to run the queries against a standard database view and a MV, and evaluate just how well a MV built on top of more than 100 million rows performs.

Getting started

To begin this exercise, I first made sure that my timings were accurate. To do this, I did two things to ensure that I wasn’t working with cached data;

First, I disabled any session level caching for my Redshift user: alter user looker set enable_result_cache_for_session to off
and second, each time I visualized the data in Looker, I used the option to ‘clear cache and refresh.’

Configuring the LookML

After running through the setup script supplied in the AWS blog, I created a view as well as a MV that aggregated nearly 150 million rows of data from the Amazon Product Reviews dataset. From there, I set up a new Looker project and where I added in those two views. In the LookML for each of the views, I added a new measure for summing up the count of reviews in each grouping. Here’s the LookML that I added:


measure: total_cnt {
  type: sum
  sql: ${cnt} ;;
}

And with that additional measure in place, it was time to start comparing query results.

The results

The query summarizes the number of reviews by product category:


SELECT product_category,
    sum(cnt)
FROM v_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

Running this in Looker against the database view took a little under three seconds, which is really impressive considering that the view joined and summarized data across 3 tables, the largest of which is almost 150 million rows.

When running the same query against the MV, it returned results after about 100 milliseconds — a full 28x faster than the standard view.

It’s important to note here that, as the data volume in the product reviews table grows, it’s reasonable to expect the query results returned by the standard view to take progressively longer. The performance of the MV, on the other hand, should not be affected by the size of the product reviews table, making the performance benefit of the MV even more evident as the size of the underlying table(s) grows.

Automatically refresh MVs with Looker

In Redshift, MVs are refreshed manually, using the REFRESH MATERIALIZED VIEWS statement. The potential drawback with this is that as new rows get added to the underlying tables that make up the MV, the MV will be out of sync with the base tables until the REFRESH command is issued.

A simple solution to this is using Persistent Derived Tables (PDTs), datagroups, and datagroup triggers in Looker. With these, Looker can use the result of a query (defined in a Looker datagroup) to determine when a PDT needs to be recreated, or for this exercise, when a MV needs to be refreshed. If the value returned by the query defined in the datagroup is different from the previous execution (by default the datagroup queries are run every five minutes), then any derived table that uses a datagroup _trigger and references that datagroup will be rebuilt.

Before you can use this technique to refresh an MV, however, you’ll first need to add a datagroup to the LookML model file that checks to see if new rows have been added to the product_reviews tables:


datagroup: product_reviews_datagroup {
  sql_trigger: select max(review_date) from product_reviews ;;
  max_cache_age: "1 hour"
}

Next, we'll create a dummy derived table that uses our new datagroup as a datagroup trigger, which means that each time the max (review_date) changes in the product_reviews table, the code associated with this derived table will be executed.


view: dummy {
  derived_table: {
    #sql: SELECT 1 as col1 ;;
    datagroup_trigger: product_reviews_datagroup
    create_process: {
    sql_step: REFRESH MATERIALIZED VIEW mvtest.mv_reviews_byprod_and_state ;;
    sql_step: DROP TABLE IF EXISTS DUMMY ;;
    sql_step: CREATE TABLE dummy as select 1 as col1 ;;
    }
  }
 
  measure: count {
    type: count
    drill_fields: [detail*]
  }
 
  dimension: col1 {
    type: number
    sql: ${TABLE}.col1 ;;
  }
 
  set: detail {
    fields: [col1]
  }
}

For this exercise, we aren’t really interested in the derived table that gets created, but instead care more about the Looker PDT syntax required for a table to be created. By using the LookML create_process syntax, we’re able to add a step to the PDT creation process that refreshes the MV. And because our derived table is associated with the product_reviews_datagroup, the MV will be updated every time the datagroup trigger detects new rows in the product_reviews table.

An important tool in the toolbox

Redshift’s MVs are an important tool in the Data Analyst’s toolbox because they provide better performing analytic queries, which in turn result in a better user experience. And with Looker, users can interact with Redshift MVs just like they would any other table, as well as automate the process of refreshing the MVs. With all of the benefits that MVs provide, perhaps it’s time to rethink some of the CTAS statements and complex aggregations in your current analytics environment and see where MVs may be able to both improve performance and reduce complexity.