Solved: Materialized View Odd Behavior - Page 2

noah_data_guy · 08-16-2023 09:13 AM

I have a source table that's pretty large. It's partitioned by day but has no cluster on it for now.

I have a bunch of materialized views that just select from this source table and filter on a specific region. The materialized view definition itself didn't have a partition filter, so any queries using the mvs needed to have this added to them.

I found that each of these materialized views, even when adding date filters, were still scanning the entire table since the rendered processed bytes for even selecting 1 day from the materialized view was as read-intensive as selecting from the entire base table.

Now, this source table only goes back to 2020-01-01 (that is the absolute min date on the table). I found that, oddly enough, if I added a filter to the materialized view directly by saying where date >= '2020-01-01', the read for the query was dropping hundreds if not terabytes of read even though this filter effectively does not exclude any data whatsoever since that is the minimum date anyways.

After implementing this, we're saving a ridiculous amount of read (and therefore money) for something that we can really not explain so well. Can anyone explain this?

ms4446

When you add a filter to a materialized view in BigQuery, the system uses this filter as a hint or context to optimize data scanning. This is achieved through techniques like predicate pushdown, where BigQuery tries to apply filters as early as possible in the data retrieval process.

The reason you observe performance improvements even with a seemingly broad filter is due to BigQuery's logical optimization techniques. By providing a filter in the materialized view definition, you give BigQuery a context about the data's structure, allowing it to make more efficient decisions about which partitions or clusters to scan.

In your case, without the filter, querying the materialized view might result in scanning a large portion of the data, say 450 GB. However, with the filter in place, even if it doesn't technically exclude much data, BigQuery can optimize the scan, leading to a much smaller amount of data being processed, like 3 GB. This optimization results in a significant performance boost.

View solution in original post