Conditions in Join Clauses (3.20+)

Release 3.20 (still weeks away when writing this) will come with the ability to add conditions in Join clauses.

The problem

Tables in many MPP database have a single sort key, that key is often time. Joining to these tables presents a challenge.

The common solution is to add a date condition in the join predicate. Until now that has been difficult in Looker.

For example, you might wish to join orders and emails for a given user over time and compute a conversion rate. Both orders and emails might be indexed on time and otherwise difficult or slow to join.

In SQL

You might write the following query

SELECT 
    users.id
    , COUNT(DISTINCT order.id)
FROM users
LEFT JOIN orders 
   ON orders.user_id=users.id 
      AND orders.created_time  BETWEEN '2015-01-01 00:00:00' AND '2015-01-31 00:00:00'
WHERE
  users.created_at BETWEEN '2015-01-01 00:00:00' AND '2015-01-07 00:00:00'
GROUP BY 1

In this case, we are looking at the number of orders that happened in january by the users created in the first week of january.

Since the orders table only has an index on time, we need to add a timeframe in order to avoid scanning the entire orders table. Adding:

 AND orders.created_time  BETWEEN '2015-01-01 00:00:00' AND '2015-01-31 00:00:00'

helps the query optimizer figure out how to pull out a subset of the data and join it.

In LookML

In release 3.20, these kinds of query can now be simply expressed in LookML.

The following example starts from the user base view then joins orders. In the explore, you must set two filters, the timeframe of the user creation and the order timeframe you wish to examine. In a non-MPP world, the SQL optimizer could probably use user_id as a key to limit scanning the orders table, but in an MPP world, you would need to set this manually.

- connection: red_look

- scoping: true           # for backward compatibility

- explore: users
  always_filter:
    users.created_date: 30 days
    orders.order_date_filter: 30 days
  joins:
  - join: orders
    relationship: one_to_many
    sql_on: |
       ${orders.user_id}=${users.id}
       AND {% condition orders.order_date_filter %} orders.created_at {% endcondition %}
      
- view: orders
  fields:
  - dimension: id
    primary_key: true
  
  - dimension: user_id
   
  - filter: order_date_filter
    type: date
  
  # You can still group by whatever time frame you like.
  - dimension_group: created
    type: time
    timeframes: [time, date, week, month]
    sql: ${TABLE}.created_at
  
  - measure: count
    type: count    
    
- view: users
  fields:
  - dimension: id
    primary_key: true
  
  - dimension_group: created
    type: time
    timeframes: [time, date, week, month]
    sql: ${TABLE}.created_at
  
  - measure: count
    type: count

Running the following query:

Yields the following SQL

SELECT 
	COUNT(*) AS "users.count",
	COUNT(DISTINCT orders.id) AS "orders.count"
FROM users
LEFT JOIN orders ON orders.user_id=users.id
AND (( orders.created_at ) >= (CONVERT_TIMEZONE('America/Los_Angeles', 'UTC', timestamp '2015-03-01')) AND ( orders.created_at ) < (CONVERT_TIMEZONE('America/Los_Angeles', 'UTC', timestamp '2015-04-01')))


WHERE 
	((users.created_at) >= (CONVERT_TIMEZONE('America/Los_Angeles', 'UTC', timestamp '2015-03-01')) AND (users.created_at) < (CONVERT_TIMEZONE('America/Los_Angeles', 'UTC', timestamp '2015-03-07')))
ORDER BY 2 DESC
LIMIT 500

This is just a simple example, there are many ways to use this feature to optimize queries.

15 10 2,788
10 REPLIES 10
Top Labels in this Space
Top Solution Authors