Hi, when we run the following queries in BigQuery, they both process the same amount of MBs, while the first one queries over a much shorter period of time:
The table is partitioned by month, on the `entered_at` column.
Why does the WHERE statement on `entered_at` has no effect on the number of MBs processed? As a result, our costs increase rapidly.
Solved! Go to Solution.
Based on some testing on a table partitioned by month, the whole month's data gets processed for any partial query.
E.g.
Range | Data Processed | Rows returned |
> tomorrow | 383KB | 0 |
> today | 383KB | 528 |
> this month | 383KB | 7,008 |
> 31st Aug | 1.6MB | 7,728 |
> 1st Aug | 1.6MB | 29,248 |
The bad news is that you'll get charged to query the whole data for a month even if it doesn't return any rows.
Without knowing your use case it's difficult to know what to suggest, but since the data is partitioned by month is it workable to only run reports once a month for the previous month's data?
Changing to a daily partition scheme would save money if you have to query more often, if that's workable.
From a quick look in the best practice document at https://cloud.google.com/bigquery/docs/best-practices-performance-input it looks like you need to use the calculated _PARTITIONTIME column to limit row scans on a partitioned table. Try replacing entered_at with _PARTITIONTIME in your queries.
Then i get the following error: Unrecognized name: _PARTITIONTIME
It looks like _PARTITIONTIME only exists for tables that are partitioned by ingestion time. Our table is partitioned by an existing column.
Hmm, yes that's only in tables partitioned by ingestion time as you say.
I tested on a small table partitioned by DAY(<timestamp column>) and it seems to work as expected, the processed data reduces as expected with a WHERE clause (even with SELECT *). Maybe partitioning by month doesn't work the same?
Does the Execution Details section have any clues? Look at the stage with the largest number of input rows and see what the READ section is doing (I suspect it's running a full table scan).
Also test with SELECT <columns> instead of SELECT * to see if that's causing it.
To keep costs down while troubleshooting you could create a temporary copy of the table with a limited set of rows and run your troubleshooting on that.
Testing with a table created partitioned by
Thanks for your help.
Changing the date does't help either. I've tried many formats. Even `WHERE entered_at > CURRENT_TIMESTAMP()` still processes the full table size.
SELECT <columns> has no effect, still 300MB for every query regardless of the period.
This are the Execution Details (when using SELECT <columns>). I can't find any clues...
Based on some testing on a table partitioned by month, the whole month's data gets processed for any partial query.
E.g.
Range | Data Processed | Rows returned |
> tomorrow | 383KB | 0 |
> today | 383KB | 528 |
> this month | 383KB | 7,008 |
> 31st Aug | 1.6MB | 7,728 |
> 1st Aug | 1.6MB | 29,248 |
The bad news is that you'll get charged to query the whole data for a month even if it doesn't return any rows.
Without knowing your use case it's difficult to know what to suggest, but since the data is partitioned by month is it workable to only run reports once a month for the previous month's data?
Changing to a daily partition scheme would save money if you have to query more often, if that's workable.
After a few days of headache, we've decided to migrate the data to a new table which will be partitioned by day.
Thanks for your time Pete!