Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

WHERE on partitioned column has no effect

Hi, when we run the following queries in BigQuery, they both process the same amount of MBs, while the first one queries over a much shorter period of time:

image.png

The table is partitioned by month, on the `entered_at` column.

Why does the WHERE statement on `entered_at` has no effect on the number of MBs processed? As a result, our costs increase rapidly.

Solved Solved
1 8 2,081
1 ACCEPTED SOLUTION

Based on some testing on a table partitioned by month, the whole month's data gets processed for any partial query.

E.g.

RangeData ProcessedRows returned
> tomorrow383KB0
> today383KB528
> this month383KB7,008
> 31st Aug1.6MB7,728
> 1st Aug1.6MB29,248

The bad news is that you'll get charged to query the whole data for a month even if it doesn't return any rows.

Without knowing your use case it's difficult to know what to suggest, but since the data is partitioned by month is it workable to only run reports once a month for the previous month's data?

Changing to a daily partition scheme would save money if you have to query more often, if that's workable.

View solution in original post

8 REPLIES 8

From a quick look in the best practice document at https://cloud.google.com/bigquery/docs/best-practices-performance-input it looks like you need to use the calculated _PARTITIONTIME column to limit row scans on a partitioned table. Try replacing entered_at with _PARTITIONTIME in your queries.

Then i get the following error: Unrecognized name: _PARTITIONTIME

It looks like _PARTITIONTIME only exists for tables that are partitioned by ingestion time. Our table is partitioned by an existing column.

Hmm, yes that's only in tables partitioned by ingestion time as you say.

I tested on a small table partitioned by DAY(<timestamp column>) and it seems to work as expected, the processed data reduces as expected with a WHERE clause (even with SELECT *). Maybe partitioning by month doesn't work the same?

Does the Execution Details section have any clues? Look at the stage with the largest number of input rows and see what the READ section is doing (I suspect it's running a full table scan).

Also test with SELECT <columns> instead of SELECT * to see if that's causing it.

To keep costs down while troubleshooting you could create a temporary copy of the table with a limited set of rows and run your troubleshooting on that.

Testing with a table created partitioned by 

datetime_TRUNC(<timestamp column>,month) also works as expected for me.
I don't suppose the date format is somehow incorrect and trying to use YYYY-DD-MM? What happens if you change the date on your 1st query to "2021-09-01 00:00:00"?

Thanks for your help.

Changing the date does't help either. I've tried many formats. Even `WHERE entered_at > CURRENT_TIMESTAMP()` still processes the full table size.

SELECT <columns> has no effect, still 300MB for every query regardless of the period.

This are the Execution Details (when using SELECT <columns>). I can't find any clues...

Nexxtmove_0-1631279660843.png

 

Based on some testing on a table partitioned by month, the whole month's data gets processed for any partial query.

E.g.

RangeData ProcessedRows returned
> tomorrow383KB0
> today383KB528
> this month383KB7,008
> 31st Aug1.6MB7,728
> 1st Aug1.6MB29,248

The bad news is that you'll get charged to query the whole data for a month even if it doesn't return any rows.

Without knowing your use case it's difficult to know what to suggest, but since the data is partitioned by month is it workable to only run reports once a month for the previous month's data?

Changing to a daily partition scheme would save money if you have to query more often, if that's workable.

After a few days of headache, we've decided to migrate the data to a new table which will be partitioned by day.

Thanks for your time Pete!