What’s in a measure?

fabio1 · 12-19-2017 02:03 PM

What’s in a measure?

All too often, we limit ourselves needlessly when we equate measures, metrics, or aggregates with the few basic aggregates that SQL provides out of the box: SUM, COUNT, AVERAGE, maybe MEDIAN if we’re feeling creative and our dialect supports it.

For a moment, though, let’s step back and think about what measures are essentially for - taking a potentially large number of underlying data points, usually within a group/dimension, and condensing them down into a digestible datum. Thousands of orders? Total revenue. Millions of sessions? Average time on site. Hundreds of users per segment? A histogram of visit frequency. A dozen campaigns with hundreds of dates and thousands of vists? A sparkline of visits over the last 30-days.

Ok, some of the above are almost over-the-top, but the point is - your imagination is the limit. Here’s a simple way to use measures when summarizing a set of data points in which only the latest datapoint should be used. Examples are inventory snapshotting, balance sheets, historical state of attributes given a timestamped list of changes, etc.

The “latest” measure

Let’s say you have inventory snapshots that you take for each piece of inventory at the end of each day:

INVENTORY_SNAPSHOT
inventory_id     | taken_at  | amount
------------------------------------
1                | 2018-01-01 | 17
2                | 2018-01-01 | 9
3                | 2018-01-01 | 29
1                | 2018-01-02 | 13
2                | 2018-01-02 | 37
3                | 2018-01-02 | 19
...

For any given window(s) of time, we want to aggregate all the rows for a given inventory ID by selecting the latest one. There are different ways to do this, but I like this one because it’s short, reusable, and has O(n) time complexity and O(1) memory complexity, as opposed to approaches based on sorting/ordering which may have worse complexity:

SELECT
    inventory_id, 
    SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int
...
GROUP BY 1
-- Written for Redshift. The :: operator is type casting, || is string concatenation

Let’s explain a bit - for all the rows matching the filter and in a given inventory ID group, we first construct a concatenated field from a fixed-length version of the sorting field (taken_at) and the field we care about (amount). Then, with this calculated field, using the MAX aggregate function gets us the latest row for each inventory id, and then SUBSTRING discards the date, leaving just the value.

Using it in LookML

All well and good, how do we translate this to LookML? You’ll notice that the “latest” aggregate is tightly bound to one grouping (inventory id in our example), so it’s not a great candidate to directly be a measure in our view, where users could pair it with arbitrary dimensions. Instead, it makes sense to put that aggregation and grouping inside of a derived table where the pairing will be hard-coded.

On the other hand, the aggregate is loosely bound to time grouping, since we want users to be able to swap in different levels of granularity - or even no time grouping at all for just the absolute latest data. In addition, this inventory-x-time aggregate often needs to be further aggregated to other uncoupled levels of grouping. With all that in mind, here is one example way to work all these things together using our _in_query feature to write some “dynamic” SQL:

explore: inventory {
  join: inventory_snapshot_dynamic {
    type: left_join
    relationship: one_to_many
    sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id} ;;
  }
}
view: inventory_snapshot_dynamic{
  derived_table:{
    sql:
        SELECT inventory_id,
        {% if    inventory_snapshot_dynamic.dynamic_granularity_date._in_query %}
             DATE_TRUNC('day',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_week._in_query %}
             DATE_TRUNC('week',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_month._in_query %}
             DATE_TRUNC('month',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_quarter._in_query %}
             DATE_TRUNC('quarter',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_year._in_query %}
             DATE_TRUNC('year',taken_at)
        {% else %} 0
        {% endif %} as dynamic_granularity
        --------
        {% if inventory_snapshot_dynamic.sum_ending_amount._in_query %}
             , SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int as ending_amount
        {% endif %}
        FROM inventory_snapshot
        WHERE {% condition inventory_snapshot_dynamic.dynamic_granularity_date %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_week %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_month %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_quarter %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_year %} taken_at {% endcondition %}
        GROUP BY 1,2
    ;;
  }
  dimension: inventory_id{}
  dimension_group: dynamic_granularity {
    group_label: "History Granularity"
    type: time
    datatype: date
    sql: {$TABLE}.dynamic_granularity ;;
    timeframes: [year,quarter,month,week,date]
  }
  measure: sum_ending_amount{
    type:sum
    sql: ${TABLE}.ending_amount
  }
}

Sharing is caring

The above works well enough in isolation, but when you want this and other similar dynamic views to be composable into one explore, it helps to isolate and externalize the date fields so multiple views can reference them. Here’s an illustrative partial example:

explore: inventory {
  join: dynamic_granularity {sql: ;; relationship: one_to_one} #Field-only view
  join: inventory_snapshot_dynamic {
    type: left_join
    sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id} ;;
  }
  join: sales_dynamic {
    type: left_join
    sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id}
    {% if inventory_snapshot_dynamic._in_query %}
       AND sales_dynamic.dynamic_granularity = inventory_snapshot_dynamic.dynamic_granularity
    {% endif %} ;;
  }
}
view: dynamic_granularity {
  label: "[History]"
  dimension_group: dynamic_granularity {
    group_label: "Granularity"
    type: time
    datatype: date
    sql: COALESCE(
      {% if inventory_snapshot_dynamic._in_query %}
        inventory_snapshot_dynamic.taken_at, 
      {%endif%}
      {% if sales_dynamic._in_query %}
        sales_dynamic.sale_at, 
      {%endif%}
      NULL ) ;;
    timeframes: [year,quarter,month,date]
  }
}
view: inventory_snapshot_dynamic{
  derived_table:{
    sql:
        SELECT inventory_id,
        {% if    dynamic_granularity.dynamic_granularity_date._in_query %}
             DATE_TRUNC('day',taken_at)
        {% elsif dynamic_granularity.dynamic_granularity_week._in_query %}
             DATE_TRUNC('week',taken_at)
...

Footnotes

The term “semi-additive measure” is often invoked in these kinds of discussions

jpereira

Fabio,

Thanks for taking the time to post this. Very helpful.

I tried a somewhat similar approach to this same problem and came across an edge case (described below) that didn’t work. I’d like to know if this case would work using your approach but it’s hard for me to tell just looking at the code. The case goes like this.

Test Case
user selects date = 'last 7 day’s in the filter, but not in the report. In this case the report incorrectly adds the seven days.
If user adds date to the report then the report works as it should.

Have you encountered this case? Are you able to verify if it works or not under this approach? Would very much appreciate any help you can provide.

fabio1

Hi, good catch! I believe this should be addressable.

Use field._is_selected rather than field._in_query in the subquery’s SELECT clause
Create an additional filter field that the user should use for date filtering
2a. Make it clear to use this field for filtering vs the others by adding it to the explore’s always_filter
2b. Unlike dimensions, when you specify the sql of a filter, this can be used to override the SQL that gets put into the outer WHERE clause. We’ll use sql: TRUE ;; since we don’t want to filter in the outer query where the dates may have already been transformed
2c. Inside the subquery(s), add a WHERE {% condition new_filter_field %} date_column_to_filter {% endcondition %}

I don’t have a test dataset available just now to test the above suggestion, but believe it should work as intended. Let me know if you run into any issues!

jpereira

Fabio,
I very much appreciated your quick response. Thank you.
I’ll try to take this approach and adapt it to my situation. Thanks for the tip again.
Juan

la5rocks

@fabio1

Is this still the best design pattern for handling semi-additive facts? I’m having trouble getting this to work as expected.

I’m trying to generalize this against a date dimension instead of for each individual fact.

fabio1

@la5rocks

Hi David, long time no talk!

As far as the measure itself, yes, I would say that using this CONCAT+MAX approach is still what I would recommend. I have found it convenient to write some UDFs to make the expression more usable, but that depends on what SQL engine/dialect you’re working with.

As far as making date logic generalized, one approach that has worked really well for me is specifying the date fields as such:

view: date {
  view_label: "[Date]"

  filter: view_description {
    label: "(info) -➤"
    description: "This is a co-dimension view. These fields combine the primary date dimensions from one or more tables (e.g., views, orders, registrations, etc.)"
    sql: TRUE /*This field is just a description, not useful in a query */;;
  }

  filter: date_filter {
    # Useful if your tables tend to have date-based sorting/clustering/partitioning, for use with always_filter
    type: date_time
    datatype: timestamp
    label: "Date Limit"
    sql: COALESCE({% condition %} ${TABLE} {% endcondition %},TRUE);; # True if null, i.e. applied to a row with no date column
  }

  dimension_group: _ {
    type: time
    datatype: timestamp
    sql: ${TABLE} ;; #### Notice we just reference the table alias directly ####
    timeframes: [week,date,raw]
  }

  # Any other date-based fields

}

And then specifying the relevant tables to apply the date fields to at the explore level like so, after having joined in the relevant tables on separate rows (e.g. using an outer join on false, or using “join paths”)

explore: foo {
  ...
  join: date {
    type: cross
    relationship: one_to_one
    sql_table_name: UNNEST([COALESCE(
      {% if views._in_query         %} views.created_at, {% endif %}
      {% if orders._in_query        %} orders.ordered_at, {% endif %}
      {% if registrations._in_query %} registrations.created_at, {% endif %}
      CAST(NULL AS TIMESTAMP)
      )]) ;;
  }
}

The above is BigQuery syntax, but any engine with a lateral join or cross apply syntax should work as well.

By specifying the tables in the join, at the explore level, you can avoid creating a large number of views that need to be maintained just for the one explore. And by having a convention across your project to always use the “date” alias, you can re-use date logic pretty conveniently

la5rocks

How does this deal with the additive part of semi-additive?

I’d like my measures to aggregate on all dimensions except for time. is that where you’ve used UDFs to work that out? This gets problematic with nested aggregation otherwise.

fabio1

The additive part would still be in a derived table as suggested by the article. Really the only new thing I’ve suggested in my latest comment is a slightly more re-usable approach to declaring and joining in the field-only view, so that you can depend on it always having the same name.

But perhaps I’m not understanding the hurdle you’re trying to overcome, do you have a example you could share with me?

flcourtois-1647

Hi,

Now I understand, that I miss interpret the value.

Thanks

fabio1

In case you have sparse inventory/snapshot tables (that is, rows are only present for a given inventory/account on days when the value changed, rather than on every day), you can also incorporate a window function to pull through the latest value into missing date periods.

For example, you can take the latest value present within each account+date grouping:

, SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int as ending_amount

... and wrap it in a window function, such as

LAST_VALUE( {{above aggregate expression}} IGNORE NULLS) OVER (PARTITION BY account_id ORDER BY date_field ASC)

However, because this will only pull values into rows which exist in the result set, you will want to first apply a cross join so that every account+date pair is in your intermediate result set. This join should work from a dynamically written date table so that you don't perform an expensive cross join against every possible date value, but rather only the date granularity and date range in the user's query - for example, a user may have queried for each month in the trailing 12 months (12 rows output per account in the intermediate result set), rather than every day since 1970 (thousands of rows output per account). This dynamic date query can start from a full date table and simply apply the conditions and grouping suggested by the query to output a few specific dates to fill in the next step of the query.

Uday2

@fabio1

This is not an analytical solution. Can you please share LookML function or calculation which can do this in 1 line instead of pre-calculation, derived table

It is very surprising LookML has no direct ML function which is otherwise publicized in this blog

https://cloud.google.com/blog/products/data-analytics/why-use-both-lookml-and-elt-tools-in-your-data...

fabio1

Are you looking to avoid using a derived table? Although you can use this type of measure directly without hard-pairing it to a specific dimension, I think in most cases this would primarily result in confusion for users who are not intimately familiar with the data model.

Uday2

yes do not want to use any derived table, pre calculation of any kind. want to calculate it in the model like in any analytics not prior.

can you please share what is the lookml function for it.

users will not be exposed to data model and there will be no confusion

this is in our existing world Power BI - DAX

CALCULATE
(
SUM('ARR'[arr_amt_usd]),
FILTER(ALL('Calendar'[Date]),'Calendar'[Date]=MAX('ARR'[ReportDate]))
)

fabio1

LookML derived_tables can be either "persistent" or "ephemeral". In the latter case, it becomes a subquery/CTE in SQL, so it would be a part of the user's query and not "pre calculated" in any way. Although the declaration is longer than in DAX, there is a lot of flexibility there too. Are there any other requirements you are trying to solve for that an ephemeral derived table would not handle?

Uday2

@fabio1

the CTE/subquery is not efficient way as have 15 dimensions apart from time, the SQL will be too long. the data size is 250 GB from the fact view. the semi additive measure has to be freely sliced across any dimension. why another CTE/SQL has to be written on top of fact and it is pre computation. Can't the same be achieved through a simple lookml function. The user has no access to model.

Only measuring the "latest" values

What’s in a measure?

The “latest” measure

Using it in LookML

Sharing is caring