Hi, I have a question about the amount of bytes processed when joining a clustered column with another table. I've noticed that it still does a full table scan, which I wasn't expecting. If I filter against a literal value, it seems to work as expected.
I've noticed this because of the high cost of doing deletes based on a join, e.g.
DELETE foo.bar WHERE EXISTS (SELECT 1 FROM CTE WHERE bar.id = baz.id)
does a full table scan, even if `id` is a clustered column.
But the same applies to SELECT queries.
Example:
CREATE SCHEMA IF NOT EXISTS `foo`;
CREATE TABLE IF NOT EXISTS
`foo.bar` ( id STRING )
CLUSTER BY
id;
INSERT INTO
`foo.bar` (id)
SELECT
GENERATE_UUID() AS id
FROM
UNNEST(GENERATE_ARRAY(1, 1024 * 1024 *10/ 36)); -- insert approximately 10MB of data
INSERT INTO
`foo.bar` (id)
VALUES
('my_id');
-- bytes processed: 7B, as expected
SELECT
*
FROM
foo.bar
WHERE
id = 'my_id';
-- bytes processed: 10.56 MB, why?
WITH
cte AS (
SELECT
'my_id' AS id)
SELECT
*
FROM
foo.bar
WHERE
EXISTS (
SELECT
1
FROM
CTE
WHERE
bar.id = cte.id);
What is the reason for this?
Is there anything I can do to make my DELETE queries that depend on other tables process less data, i.e. cheaper?
Thanks!
Solved! Go to Solution.
Even with a clustered column, BigQuery may opt for a full table scan in these scenarios:
Joins and Subqueries: BigQuery's query optimizer sometimes struggles to propagate the filtering benefits of clustered columns through joins and subqueries. This is especially true with correlated subqueries (like your EXISTS
clause), where the subquery is evaluated for each row in the outer query, and BigQuery may not recognize opportunities to leverage clustering efficiently.
Non-Literal Filters: When you filter directly on a literal value (e.g., id = 'my_id'
), BigQuery can efficiently use the clustered index to find the relevant data blocks. However, when the filter involves values from another table or the result of a subquery, BigQuery may not apply the same optimization, often resulting in a full table scan.
Query Complexity: In complex queries, BigQuery might opt for a full table scan if it assesses this as the most efficient overall strategy, even at the expense of not utilizing the clustered index.
Optimizing DELETE Queries
Here are some strategies to potentially reduce the data processed in your DELETE scenarios:
Staging Table:
id
values you intend to delete. Be sure to include CLUSTER BY id
when creating the table:CREATE TABLE foo.temp_delete AS SELECT id FROM CTE CLUSTER BY id;
JOIN
between foo.bar
and the staging table in your DELETE statement:DELETE FROM foo.bar WHERE id IN (SELECT id FROM foo.temp_delete);
Batching:
Partitioning (if applicable):
foo.bar
table includes a time-based column, partitioning it could be beneficial. You might then filter on the partition before performing the DELETE, reducing the amount of data scanned.Denormalization (with caution):
id
column in foo.bar
. This eliminates the need for a join during DELETE but be cautious, as it can increase storage costs and pose challenges for maintaining data consistency.Additional Considerations
Cost vs. Performance: While these optimizations can help, DELETE operations in BigQuery can still be costly. Balancing cost reduction with performance is crucial.
BigQuery Updates: Keep an eye on BigQuery's release notes. The platform is continuously evolving, and future updates may enhance the optimizer's ability to more efficiently handle joins and subqueries with clustered columns.
BigQuery Cost Calculator: Use the BigQuery pricing calculator to estimate the potential cost savings of different optimization strategies before implementing them.