Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Why a full table scan when joining a clustered column to another table?

Hi, I have a question about the amount of bytes processed when joining a clustered column with another table. I've noticed that it still does a full table scan, which I wasn't expecting. If I filter against a literal value, it seems to work as expected.

I've noticed this because of the high cost of doing deletes based on a join, e.g.

 

DELETE foo.bar WHERE EXISTS (SELECT 1 FROM CTE WHERE bar.id = baz.id)

 

does a full table scan, even if `id` is a clustered column.

But the same applies to SELECT queries.

Example:

 

CREATE SCHEMA IF NOT EXISTS `foo`;
CREATE TABLE IF NOT EXISTS
  `foo.bar` ( id STRING )
CLUSTER BY
  id;
INSERT INTO
  `foo.bar` (id)
SELECT
  GENERATE_UUID() AS id
FROM
  UNNEST(GENERATE_ARRAY(1, 1024 * 1024 *10/ 36)); -- insert approximately 10MB of data
INSERT INTO
  `foo.bar` (id)
VALUES
  ('my_id');

-- bytes processed: 7B, as expected
SELECT
  *
FROM
  foo.bar
WHERE
  id = 'my_id';

-- bytes processed: 10.56 MB, why?
WITH
  cte AS (
  SELECT
    'my_id' AS id)
SELECT
  *
FROM
  foo.bar
WHERE
  EXISTS (
  SELECT
    1
  FROM
    CTE
  WHERE
    bar.id = cte.id);

 

What is the reason for this?

Is there anything I can do to make my DELETE queries that depend on other tables process less data, i.e. cheaper?

Thanks!

Solved Solved
5 6 2,910
1 ACCEPTED SOLUTION

Even with a clustered column, BigQuery may opt for a full table scan in these scenarios:

  1. Joins and Subqueries: BigQuery's query optimizer sometimes struggles to propagate the filtering benefits of clustered columns through joins and subqueries. This is especially true with correlated subqueries (like your EXISTS clause), where the subquery is evaluated for each row in the outer query, and BigQuery may not recognize opportunities to leverage clustering efficiently.

  2. Non-Literal Filters: When you filter directly on a literal value (e.g., id = 'my_id'), BigQuery can efficiently use the clustered index to find the relevant data blocks. However, when the filter involves values from another table or the result of a subquery, BigQuery may not apply the same optimization, often resulting in a full table scan.

  3. Query Complexity: In complex queries, BigQuery might opt for a full table scan if it assesses this as the most efficient overall strategy, even at the expense of not utilizing the clustered index.

Optimizing DELETE Queries

Here are some strategies to potentially reduce the data processed in your DELETE scenarios:

  1. Staging Table:

    • Before your DELETE operation, create a temporary staging table containing only the id values you intend to delete. Be sure to include CLUSTER BY id when creating the table:
     
    CREATE TABLE foo.temp_delete AS SELECT id FROM CTE CLUSTER BY id; 
    
    • Use a JOIN between foo.bar and the staging table in your DELETE statement:
     
    DELETE FROM foo.bar WHERE id IN (SELECT id FROM foo.temp_delete); 
    
  2. Batching:

    • If your DELETE operation involves a large number of rows, consider breaking it down into smaller batches. This can reduce the amount of data processed in each query and potentially improve performance.
  3. Partitioning (if applicable):

    • If your foo.bar table includes a time-based column, partitioning it could be beneficial. You might then filter on the partition before performing the DELETE, reducing the amount of data scanned.
  4. Denormalization (with caution):

    • In cases where the relationship between your tables is relatively stable, consider denormalizing by duplicating the relevant id column in foo.bar. This eliminates the need for a join during DELETE but be cautious, as it can increase storage costs and pose challenges for maintaining data consistency.

Additional Considerations

  • Cost vs. Performance: While these optimizations can help, DELETE operations in BigQuery can still be costly. Balancing cost reduction with performance is crucial.

  • BigQuery Updates: Keep an eye on BigQuery's release notes. The platform is continuously evolving, and future updates may enhance the optimizer's ability to more efficiently handle joins and subqueries with clustered columns.

  • BigQuery Cost Calculator: Use the BigQuery pricing calculator to estimate the potential cost savings of different optimization strategies before implementing them.

View solution in original post

6 REPLIES 6