A long standing limitation of SQL has been the asymmetry and complexity of its aggregate computations. In Looker, aggregate computations - like SUM()
, AVERAGE()
and COUNT()
- are used to calculate measures.
SQL lets programmers join tables in arbitrarily complex ways, but when doing this, only a few of the aggregate computations work reliably.
The aggregate functions that work, regardless of how tables are joined, are called “symmetric aggregates.” They are:
COUNT(DISTINCT)
MAX()
MIN()
The functions that need to be used very carefully, to avoid incorrect results, are called “non-symmetric aggregates.” They are:
SUM()
AVERAGE()
COUNT()
The non-symmetric aggregates only work if two conditions are met:
FROM
clause of the query, and not in a joinIf either of these conditions are not met SUM()
, AVERAGE()
and COUNT()
will be computed incorrectly. To see a simple example of this problem, consider the following tables:
orders
id | order_date | user_id | shipping_amount |
---|---|---|---|
1 | 2014-01-01 | 1 | 5.00 |
2 | 2014-01-02 | 2 | 6.00 |
order_items
id | order_id | sku | amount |
---|---|---|---|
1 | 1 | 1 | 23.00 |
2 | 1 | 2 | 12.00 |
3 | 2 | 2 | 12.00 |
Now assume we run the following query:
SELECT
SUM(order_items.amount) AS total_amount,
SUM(orders.shipping_amount) AS total_shipping
FROM order_items
LEFT JOIN orders ON orders.id = order_items.order_id
In this case the relationship between order_items and orders is many-to-one (not one-to-many or one-to-one). Consequently, some of the results will be bad. To understand why this is the case, consider the result of the query before the SUM()
is applied:
ORDER_ITEMS | ORDERS | ||||||
---|---|---|---|---|---|---|---|
id | order_id | sku | amount | id | order_date | user_id | shipping_amount |
1 | 1 | 1 | 23.00 | 1 | 2014-01-01 | 1 | 5.00 |
2 | 1 | 2 | 12.00 | 1 | 2014-01-01 | 1 | 5.00 |
3 | 2 | 2 | 12.00 | 2 | 2014-01-02 | 2 | 6.00 |
Now it’s easier to see that, when orders.shipping_amount is summed, it will provide an incorrect result. Since order number 1 is repeated twice in the results, the shipping amount for it is added in twice. SUM(orders.shipping_amount)
seems like it should work, but it doesn’t.
On the other hand, the symmetric aggregates do work reliably. If we swap out our SUM()
for COUNT(DISTINCT)
in our query it computes correctly:
SELECT
COUNT(DISTINCT order_items.id) as order_items_count
, COUNT(DISTINCT orders.id) as orders_count
FROM order_items
LEFT JOIN orders ON orders.id = order_items.order_id
The worst part is, there’s no easy check to determine what the join relationships are, and therefore, if they’re going to cause problems. Although there are some tricks to determine the join relationship at the current time, it’s no guarantee that this will be the join relationship in the future. The only way to know for sure is to understand the underlying data and how it is being created. Brett Sauvé wrote a great post about this that explains how to work through all of this.
LookML operates at an abstraction level above SQL (all LookML code is eventually converted to SQL). In LookML, you declare the join relationships (one-to-many, many-to-many, and one-to-one) as well as the join type (LEFT JOIN
, FULL OUTER JOIN
, etc). Additionally, LookML measures (which eventually become the aggregate computations we’ve been talking about) are declared in the view they are aggregating.
LookML measures are symmetric, no matter what. It doesn’t matter what the join structure is or if the table is in the FROM
clause. This simplification makes asking questions much more direct, and takes away the complexity involved with certain queries. As long as you define your joins correctly, errors in calculations are eliminated.
MySQL’s sample video store database (yes, a video store, shows you how old it is) actually presents some interesting analytical challenges. They can demonstrate the power of Looker’s aggregate computation abilities.
In the example database, copies of a film are purchased and put in inventory. At some later date a customer rents a copy from inventory, which is recorded as a rental. When the rental is returned the customer makes a payment. Films have one or more categories, which are looked up in film_category, and stored in category. Similarly, films have one or more actors, which are looked up in film_actor, and stored in actor.
Visually, all the join relationships look like this:
film_category-----film-----film_actor
| | |
| | |
category inventory actor
|
|
rental
/ \
/ \
customer-----payment
In prior versions of LookML it was important to try and avoid one-to-many joins. One of the paths we might have used to join as many things together as possible, while avoiding one-to-many joins, would be to start with payment and join as follows:
payment
↖ rental
↖ customer
↖ inventory
↖ film
As just one example of the complicated considerations involved: note that we’re joining customer to payment through the rental, instead of directly, because if we did go directly, it would be a one-to-many join.
These types of things would be easy to compute when exploring from payment:
These things would not be so easy to compute:
In our new model, we’ll start exploring from inventory instead of payment. The joins will look like:
inventory
↖ film
↖ film_category
↖ category
↖ film_actor
↖ actor
↖ rental
↖ customer
↖ payment
In previous versions of Looker, this would have been a bad thing to do. For example, inventory to film_category is a one-to-many relationship. Now, however, this is perfectly fine.
This is how the LookML is written:
- connection: video_store
- scoping: true
- include: "sakila.view.lookml"
- explore: inventory
joins:
- join: film
foreign_key: inventory.film_id
# Fanout!
- join: film_category
sql_on: ${film.film_id} = ${film_category.film_id}
relationship: one_to_many
- join: category
foreign_key: film_category.category_id
# Fanout!
- join: film_actor
sql_on: ${film.film_id} = ${film_actor.film_id}
relationship: one_to_many
- join: actor
foreign_key: film_actor.actor_id
- join: rental
sql_on: ${inventory.inventory_id} = ${rental.inventory_id}
relationship: one_to_many
- join: customer
foreign_key: rental.customer_id
- join: payment
sql_on: ${payment.rental_id} = ${rental.rental_id}
relationship: one_to_many
To the payment view, we’ll add the measure:
- measure: total_amount
type: sum
value_format_name: decimal_2
sql: ${amount}
And to the inventory view, we’ll add this measure. The replacement cost isn’t an attribute of inventory, but rather an attribute of film.
- measure: total_replacement_cost
type: sum
value_format_name: decimal_2
sql: ${film.replacement_cost}
Looking at costs vs. revenue is the key to any business. Costs in this case come in the form of inventory. Revenue comes in the form of payments on rentals. Using the measures we just created, we can compute these sums. From the data below, we can see that we’ve spent about $93K on inventory, and over time, have been paid back $67K.
In raw SQL it wouldn’t be possible to compute a sum in both the inventory view and payment view at the same time. It would require two separate queries (or sub queries). This would add complexity and we wouldn’t have the ability to drill.
We can combine the two measures we just created to see how close we are to breaking even:
- measure: percent_payback
type: number
value_format_name: decimal_2
sql: 100.0 * ${payment.total_amount} / ${total_replacement_cost}
To get an idea of how payback changes over time, we can group by month, and display our new Percent Payback measure. It looks like payback is about 4 months.
If we limit ourselves to May 2005, where we achieved payback overall, we can determine which categories do the best:
We can determine which actors returned the most money to us, and from which categories the money is coming (we’ve limited to the top 3 categories just to make the table easier to read) .
Note: It’s still important to think about what this data is telling us. Since each inventory item can have multiple categories and actors, the money from a single payment might show up more than once, for each actor and category associated with that payment. If you think about it, this makes sense. If we wanted to avoid this repeated data, we’d somehow have to split each payment amongst all the different categories and actors associated with it. We haven’t explained to Looker how it should do that.
So, even though we’ve avoided the classic SQL problem where certain types of joins create bad sums, you can still segment the data such that information is repeated. In this example, the data is still meaningful and correct in some sense. However, you wouldn’t be able to add up these numbers and arrive at the total amount of money the video store has made.
The above computations would be very difficult to write in standard SQL. LookML makes the difficult easy.