Symmetric Aggregates - Page 2

lloydtabb · 01-16-2015 04:38 PM

The Problem

A long standing limitation of SQL has been the asymmetry and complexity of its aggregate computations. In Looker, aggregate computations - like SUM(), AVERAGE() and COUNT() - are used to calculate measures.

SQL lets programmers join tables in arbitrarily complex ways, but when doing this, only a few of the aggregate computations work reliably.

The aggregate functions that work, regardless of how tables are joined, are called “symmetric aggregates.” They are:

COUNT(DISTINCT)
MAX()
MIN()

The functions that need to be used very carefully, to avoid incorrect results, are called “non-symmetric aggregates.” They are:

SUM()
AVERAGE()
COUNT()

The non-symmetric aggregates only work if two conditions are met:

The table that they are used on is in the FROM clause of the query, and not in a join
All the joins in the query have a many-to-one or one-to-one relationship

If either of these conditions are not met SUM(), AVERAGE() and COUNT() will be computed incorrectly. To see a simple example of this problem, consider the following tables:

orders

id	order_date	user_id	shipping_amount
1	2014-01-01	1	5.00
2	2014-01-02	2	6.00

order_items

id	order_id	sku	amount
1	1	1	23.00
2	1	2	12.00
3	2	2	12.00

Now assume we run the following query:

SELECT 
  SUM(order_items.amount) AS total_amount,
  SUM(orders.shipping_amount) AS total_shipping
FROM order_items
LEFT JOIN orders ON orders.id = order_items.order_id

In this case the relationship between order_items and orders is many-to-one (not one-to-many or one-to-one). Consequently, some of the results will be bad. To understand why this is the case, consider the result of the query before the SUM() is applied:

ORDER_ITEMS				ORDERS
id	order_id	sku	amount	id	order_date	user_id	shipping_amount
1	1	1	23.00	1	2014-01-01	1	5.00
2	1	2	12.00	1	2014-01-01	1	5.00
3	2	2	12.00	2	2014-01-02	2	6.00

Now it’s easier to see that, when orders.shipping_amount is summed, it will provide an incorrect result. Since order number 1 is repeated twice in the results, the shipping amount for it is added in twice. SUM(orders.shipping_amount) seems like it should work, but it doesn’t.

On the other hand, the symmetric aggregates do work reliably. If we swap out our SUM() for COUNT(DISTINCT) in our query it computes correctly:

SELECT 
  COUNT(DISTINCT order_items.id) as order_items_count
  , COUNT(DISTINCT orders.id) as orders_count
FROM order_items
LEFT JOIN orders ON orders.id = order_items.order_id

The worst part is, there’s no easy check to determine what the join relationships are, and therefore, if they’re going to cause problems. Although there are some tricks to determine the join relationship at the current time, it’s no guarantee that this will be the join relationship in the future. The only way to know for sure is to understand the underlying data and how it is being created. Brett Sauvé wrote a great post about this that explains how to work through all of this.

The Fix

LookML operates at an abstraction level above SQL (all LookML code is eventually converted to SQL). In LookML, you declare the join relationships (one-to-many, many-to-many, and one-to-one) as well as the join type (LEFT JOIN, FULL OUTER JOIN, etc). Additionally, LookML measures (which eventually become the aggregate computations we’ve been talking about) are declared in the view they are aggregating.

LookML measures are symmetric, no matter what. It doesn’t matter what the join structure is or if the table is in the FROM clause. This simplification makes asking questions much more direct, and takes away the complexity involved with certain queries. As long as you define your joins correctly, errors in calculations are eliminated.

Video Store Example

MySQL’s sample video store database (yes, a video store, shows you how old it is) actually presents some interesting analytical challenges. They can demonstrate the power of Looker’s aggregate computation abilities.

In the example database, copies of a film are purchased and put in inventory. At some later date a customer rents a copy from inventory, which is recorded as a rental. When the rental is returned the customer makes a payment. Films have one or more categories, which are looked up in film_category, and stored in category. Similarly, films have one or more actors, which are looked up in film_actor, and stored in actor.

Visually, all the join relationships look like this:

film_category-----film-----film_actor
      |            |           |
      |            |           |
   category    inventory     actor
                   |
                   |
                 rental
                  / \
                 /   \
         customer-----payment

In prior versions of LookML it was important to try and avoid one-to-many joins. One of the paths we might have used to join as many things together as possible, while avoiding one-to-many joins, would be to start with payment and join as follows:

payment
↖ rental
  ↖ customer
  ↖ inventory
    ↖ film

As just one example of the complicated considerations involved: note that we’re joining customer to payment through the rental, instead of directly, because if we did go directly, it would be a one-to-many join.

These types of things would be easy to compute when exploring from payment:

Payments by month, week, day of week
Top performing films
Top customers

These things would not be so easy to compute:

Inventory items that haven’t been rented
Duration to pay back an inventory item once its been rented
Revenue of rentals by category or actor

Joining the New Way

In our new model, we’ll start exploring from inventory instead of payment. The joins will look like:

inventory
↖ film
  ↖ film_category
    ↖ category
  ↖ film_actor
    ↖ actor
  ↖ rental
    ↖ customer
    ↖ payment

In previous versions of Looker, this would have been a bad thing to do. For example, inventory to film_category is a one-to-many relationship. Now, however, this is perfectly fine.

This is how the LookML is written:

- connection: video_store
- scoping: true
- include: "sakila.view.lookml"

- explore: inventory
  joins:
    - join: film
      foreign_key: inventory.film_id
  
    # Fanout!
    - join: film_category
      sql_on: ${film.film_id} = ${film_category.film_id}
      relationship: one_to_many

    - join: category
      foreign_key: film_category.category_id

    # Fanout!
    - join: film_actor
      sql_on: ${film.film_id} = ${film_actor.film_id}
      relationship: one_to_many

    - join: actor
      foreign_key: film_actor.actor_id
      
    - join: rental
      sql_on: ${inventory.inventory_id} = ${rental.inventory_id}
      relationship: one_to_many

    - join: customer
      foreign_key: rental.customer_id
      
    - join: payment
      sql_on: ${payment.rental_id} = ${rental.rental_id}
      relationship: one_to_many

Add a Few Measures (a.k.a. Aggregate Computations)

To the payment view, we’ll add the measure:

- measure: total_amount
  type: sum
  value_format_name: decimal_2
  sql: ${amount}

And to the inventory view, we’ll add this measure. The replacement cost isn’t an attribute of inventory, but rather an attribute of film.

- measure: total_replacement_cost
  type: sum
  value_format_name: decimal_2
  sql: ${film.replacement_cost}

A Simple Business Overview: Computing Return on Investment

Looking at costs vs. revenue is the key to any business. Costs in this case come in the form of inventory. Revenue comes in the form of payments on rentals. Using the measures we just created, we can compute these sums. From the data below, we can see that we’ve spent about $93K on inventory, and over time, have been paid back $67K.

Explore Data

In raw SQL it wouldn’t be possible to compute a sum in both the inventory view and payment view at the same time. It would require two separate queries (or sub queries). This would add complexity and we wouldn’t have the ability to drill.

Build a Compound Measure: Percent Payback

We can combine the two measures we just created to see how close we are to breaking even:

- measure: percent_payback
  type: number
  value_format_name: decimal_2
  sql: 100.0 * ${payment.total_amount} / ${total_replacement_cost}

How Long Does It Take to be Paid Back?

To get an idea of how payback changes over time, we can group by month, and display our new Percent Payback measure. It looks like payback is about 4 months.

Explore Data

Does Payback Vary by Category?

If we limit ourselves to May 2005, where we achieved payback overall, we can determine which categories do the best:

Explore Data

Top 10 Selling Actors and Their Genres

We can determine which actors returned the most money to us, and from which categories the money is coming (we’ve limited to the top 3 categories just to make the table easier to read) .

Explore Data

Note: It’s still important to think about what this data is telling us. Since each inventory item can have multiple categories and actors, the money from a single payment might show up more than once, for each actor and category associated with that payment. If you think about it, this makes sense. If we wanted to avoid this repeated data, we’d somehow have to split each payment amongst all the different categories and actors associated with it. We haven’t explained to Looker how it should do that.

So, even though we’ve avoided the classic SQL problem where certain types of joins create bad sums, you can still segment the data such that information is repeated. In this example, the data is still meaningful and correct in some sense. However, you wouldn’t be able to add up these numbers and arrive at the total amount of money the video store has made.

Try It at Home

The above computations would be very difficult to write in standard SQL. LookML makes the difficult easy.