Combine "FULL OUTER JOIN ON FALSE" with "Introduci... - Page 2

spmclaugh · 09-28-2020 11:07 AM

Hi. I read the article " Outer Join on False - or: How I learned to stop fanning-out and love the NULL and [Analytic Block] - Pivot by Top X - Introducing bind_all_filters: yes. First of all, thanks for the great articles.

I’m trying to make a model / Explore which:

Combines multiple (4) existing PDTs in a way like the first article describes, where the data is not fanned-out by performing “normal” joins between the 4 tables
Uses non-persistent derived tables (probably native, to use bind_all_filters), which get pre-filtered before being merged for the final user-facing table. This is for performance reasons; our source tables are > 25TB each, and I’d rather do the joining of these tables after they are relatively small by being already-filtered.
Create a derived table that finds common time bins (time interval defined by a Liquid parameter) between the 4 tables. Do this by taking the filtered tables, show only the calculated time bin column, and do a union on these 4 sub-tables. Then merge the 4 tables back into this derived table using left joins. This essentially performs a fuzzy AND filter. May provide user the ability to make this an OR for certain searches.
Create a final table which left joins the 4 filtered tables using the common time bins from the table above. Do this by including 2 columns in the common time bin table: the time bin values that are common to all 4 tables, plus a column with the names of each table (one of 4 string literals).
Return results to the user without re-performing the user’s filter on the final table above. I think I can do this with a sql_always_where clause of “OR TRUE” (without parenthesis) on the final table. Still have to test this.

I’m pretty sure I can do something similar to the above using Liquid variables for every single filterable field (there are like 3 dozen), but I’d rather try doing it with bind_all_filters, if I can do it in a performance-friendly way.

I’m a bit confused by the bind_all_filters feature, though. The article states:

Note: A native derived table with bind_all_filters is required to be joined in the same explore it came from; i.e., explore = explore_source

This is confusing me. At what “stage” in the source Explore does the NDT receive its input table? My explore was going to include:

the 4 filtered tables above,
maybe the 4 PDTs if the noted NDTs have to use explore_source targets contained in the explore (not arbitrary SQL tables)
the above table containing the timestamp bins common to all 4 filtered tables.

But if the [4+1] NDTs all get their tables from the single Explore source, and those tables are included in the Explore source, this seems cyclical. Do the NDTs get the “full” table obtained after doing all the joining in the Explore? And then they just pick the columns that they need? If so, wouldn’t this cause a possible performance loss (tons of superfluous rows per NDT) by joining those tables into a massive table before applying my custom logic?

Questions:

Can anybody clarify the order of operations performed when an Explore contains a view, and that view uses the Explore for its explore_source? I’ve been meaning to just play with what Looker generates as SQL while playing with such an Explore, but I haven’t been able to dedicate enough time.
Am I over-worried about doing my join (after which I calculate time bins, group by time bins, and then define this as a temporary table) before filtering the rather-large individual tables?
Should I just stick with using Liquid variables for this task if my main goal is performance, or does anybody have ideas for how I could accomplish a high-performance version of this using bind_all_filters versus Liquid variables?

Thanks so much!!

Sean

Combine "FULL OUTER JOIN ON FALSE" with "Introducing bind_all_filters"