Solved: Re: WHERE on partitioned column has no effect

Nexxtmove · 09-10-2021 12:07 AM

Hi, when we run the following queries in BigQuery, they both process the same amount of MBs, while the first one queries over a much shorter period of time:

The table is partitioned by month, on the `entered_at` column.

Why does the WHERE statement on `entered_at` has no effect on the number of MBs processed? As a result, our costs increase rapidly.

PetePlatt

Based on some testing on a table partitioned by month, the whole month's data gets processed for any partial query.

E.g.

Range	Data Processed	Rows returned
> tomorrow	383KB	0
> today	383KB	528
> this month	383KB	7,008
> 31st Aug	1.6MB	7,728
> 1st Aug	1.6MB	29,248

The bad news is that you'll get charged to query the whole data for a month even if it doesn't return any rows.

Without knowing your use case it's difficult to know what to suggest, but since the data is partitioned by month is it workable to only run reports once a month for the previous month's data?

Changing to a daily partition scheme would save money if you have to query more often, if that's workable.

View solution in original post

PetePlatt

From a quick look in the best practice document at https://cloud.google.com/bigquery/docs/best-practices-performance-input it looks like you need to use the calculated _PARTITIONTIME column to limit row scans on a partitioned table. Try replacing entered_at with _PARTITIONTIME in your queries.

Nexxtmove

Then i get the following error: Unrecognized name: _PARTITIONTIME

It looks like _PARTITIONTIME only exists for tables that are partitioned by ingestion time. Our table is partitioned by an existing column.

PetePlatt

Hmm, yes that's only in tables partitioned by ingestion time as you say.

I tested on a small table partitioned by DAY(<timestamp column>) and it seems to work as expected, the processed data reduces as expected with a WHERE clause (even with SELECT *). Maybe partitioning by month doesn't work the same?

Does the Execution Details section have any clues? Look at the stage with the largest number of input rows and see what the READ section is doing (I suspect it's running a full table scan).

Also test with SELECT <columns> instead of SELECT * to see if that's causing it.

To keep costs down while troubleshooting you could create a temporary copy of the table with a limited set of rows and run your troubleshooting on that.

PetePlatt

Testing with a table created partitioned by

datetime_TRUNC(<timestamp column>,month) also works as expected for me.

I don't suppose the date format is somehow incorrect and trying to use YYYY-DD-MM? What happens if you change the date on your 1st query to "2021-09-01 00:00:00"?

Nexxtmove

Thanks for your help.

Changing the date does't help either. I've tried many formats. Even `WHERE entered_at > CURRENT_TIMESTAMP()` still processes the full table size.

Nexxtmove

SELECT <columns> has no effect, still 300MB for every query regardless of the period.

This are the Execution Details (when using SELECT <columns>). I can't find any clues...

PetePlatt

Based on some testing on a table partitioned by month, the whole month's data gets processed for any partial query.

E.g.

Range	Data Processed	Rows returned
> tomorrow	383KB	0
> today	383KB	528
> this month	383KB	7,008
> 31st Aug	1.6MB	7,728
> 1st Aug	1.6MB	29,248

The bad news is that you'll get charged to query the whole data for a month even if it doesn't return any rows.

Without knowing your use case it's difficult to know what to suggest, but since the data is partitioned by month is it workable to only run reports once a month for the previous month's data?

Changing to a daily partition scheme would save money if you have to query more often, if that's workable.

Nexxtmove

After a few days of headache, we've decided to migrate the data to a new table which will be partitioned by day.

Thanks for your time Pete!