BigQuery not taking advantage of WHERE predicate

morpheus · 10-16-2022 05:21 PM

We tried comparing the performance of BigQuery on the Star Schema Benchmark described here. The benchmark creates a table which is partitioned on Year and clustered on date.

When following query was run (Q1.3):

SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue

FROM lineorder_flat

WHERE (toISOWeek(LO_ORDERDATE) = 6) AND (toYear(LO_ORDERDATE) = 1994) AND ((LO_DISCOUNT >= 5) AND (LO_DISCOUNT <= 7)) AND ((LO_QUANTITY >= 26) AND (LO_QUANTITY <= 35))

we found BQ did a scan of all the rows in partition corresponding to the year 1994.

Expected: since the data is clustered (ordered) on the date, it should only have scanned a much smaller subset of the data. For comparison Clickhouse does just that - it only processes the rows corresponding to week 6 which is 50x less (there are 52 weeks in a year).

Conclusion: clustering the data has seemingly no benefit

Why is it like this? Can anyone explain this?

We saw same behavior for queries Q1.2, Q3.4. In those queries there is a predicate on the month clause in addition to the year. But BQ ends up processing the entire partition instead of just the rows in the month.

Any explanation and guidance appreciated. we are quite impressed with BQ's speed and massive parallelization.

RC1

@morpheus I think clustering benefits when your data size is greater than 1 GB. What is the size of the data in 1994 partition and what is the size of your overall data ?

morpheus

here are the details of the table we are querying:

the 1994 partition has 303M rows in it.