Solved: Re: Why a full table scan when joining a clustered...

uiltjesrups · 05-10-2024 01:58 AM

Hi, I have a question about the amount of bytes processed when joining a clustered column with another table. I've noticed that it still does a full table scan, which I wasn't expecting. If I filter against a literal value, it seems to work as expected.

I've noticed this because of the high cost of doing deletes based on a join, e.g.

DELETE foo.bar WHERE EXISTS (SELECT 1 FROM CTE WHERE bar.id = baz.id)

does a full table scan, even if `id` is a clustered column.

But the same applies to SELECT queries.

Example:

CREATE SCHEMA IF NOT EXISTS `foo`;
CREATE TABLE IF NOT EXISTS
  `foo.bar` ( id STRING )
CLUSTER BY
  id;
INSERT INTO
  `foo.bar` (id)
SELECT
  GENERATE_UUID() AS id
FROM
  UNNEST(GENERATE_ARRAY(1, 1024 * 1024 *10/ 36)); -- insert approximately 10MB of data
INSERT INTO
  `foo.bar` (id)
VALUES
  ('my_id');

-- bytes processed: 7B, as expected
SELECT
  *
FROM
  foo.bar
WHERE
  id = 'my_id';

-- bytes processed: 10.56 MB, why?
WITH
  cte AS (
  SELECT
    'my_id' AS id)
SELECT
  *
FROM
  foo.bar
WHERE
  EXISTS (
  SELECT
    1
  FROM
    CTE
  WHERE
    bar.id = cte.id);

What is the reason for this?

Is there anything I can do to make my DELETE queries that depend on other tables process less data, i.e. cheaper?

Thanks!

ms4446

Even with a clustered column, BigQuery may opt for a full table scan in these scenarios:

Joins and Subqueries: BigQuery's query optimizer sometimes struggles to propagate the filtering benefits of clustered columns through joins and subqueries. This is especially true with correlated subqueries (like your EXISTS clause), where the subquery is evaluated for each row in the outer query, and BigQuery may not recognize opportunities to leverage clustering efficiently.
Non-Literal Filters: When you filter directly on a literal value (e.g., id = 'my_id'), BigQuery can efficiently use the clustered index to find the relevant data blocks. However, when the filter involves values from another table or the result of a subquery, BigQuery may not apply the same optimization, often resulting in a full table scan.
Query Complexity: In complex queries, BigQuery might opt for a full table scan if it assesses this as the most efficient overall strategy, even at the expense of not utilizing the clustered index.

Optimizing DELETE Queries

Here are some strategies to potentially reduce the data processed in your DELETE scenarios:

Staging Table:
- Before your DELETE operation, create a temporary staging table containing only the id values you intend to delete. Be sure to include CLUSTER BY id when creating the table:
```
CREATE TABLE foo.temp_delete AS SELECT id FROM CTE CLUSTER BY id; 
```
- Use a JOIN between foo.bar and the staging table in your DELETE statement:
```
DELETE FROM foo.bar WHERE id IN (SELECT id FROM foo.temp_delete); 
```
Batching:
- If your DELETE operation involves a large number of rows, consider breaking it down into smaller batches. This can reduce the amount of data processed in each query and potentially improve performance.
Partitioning (if applicable):
- If your foo.bar table includes a time-based column, partitioning it could be beneficial. You might then filter on the partition before performing the DELETE, reducing the amount of data scanned.
Denormalization (with caution):
- In cases where the relationship between your tables is relatively stable, consider denormalizing by duplicating the relevant id column in foo.bar. This eliminates the need for a join during DELETE but be cautious, as it can increase storage costs and pose challenges for maintaining data consistency.

Additional Considerations

Cost vs. Performance: While these optimizations can help, DELETE operations in BigQuery can still be costly. Balancing cost reduction with performance is crucial.
BigQuery Updates: Keep an eye on BigQuery's release notes. The platform is continuously evolving, and future updates may enhance the optimizer's ability to more efficiently handle joins and subqueries with clustered columns.
BigQuery Cost Calculator: Use the BigQuery pricing calculator to estimate the potential cost savings of different optimization strategies before implementing them.

View solution in original post

ms4446

Even with a clustered column, BigQuery may opt for a full table scan in these scenarios:

Joins and Subqueries: BigQuery's query optimizer sometimes struggles to propagate the filtering benefits of clustered columns through joins and subqueries. This is especially true with correlated subqueries (like your EXISTS clause), where the subquery is evaluated for each row in the outer query, and BigQuery may not recognize opportunities to leverage clustering efficiently.
Non-Literal Filters: When you filter directly on a literal value (e.g., id = 'my_id'), BigQuery can efficiently use the clustered index to find the relevant data blocks. However, when the filter involves values from another table or the result of a subquery, BigQuery may not apply the same optimization, often resulting in a full table scan.
Query Complexity: In complex queries, BigQuery might opt for a full table scan if it assesses this as the most efficient overall strategy, even at the expense of not utilizing the clustered index.

Optimizing DELETE Queries

Here are some strategies to potentially reduce the data processed in your DELETE scenarios:

Staging Table:
- Before your DELETE operation, create a temporary staging table containing only the id values you intend to delete. Be sure to include CLUSTER BY id when creating the table:
```
CREATE TABLE foo.temp_delete AS SELECT id FROM CTE CLUSTER BY id; 
```
- Use a JOIN between foo.bar and the staging table in your DELETE statement:
```
DELETE FROM foo.bar WHERE id IN (SELECT id FROM foo.temp_delete); 
```
Batching:
- If your DELETE operation involves a large number of rows, consider breaking it down into smaller batches. This can reduce the amount of data processed in each query and potentially improve performance.
Partitioning (if applicable):
- If your foo.bar table includes a time-based column, partitioning it could be beneficial. You might then filter on the partition before performing the DELETE, reducing the amount of data scanned.
Denormalization (with caution):
- In cases where the relationship between your tables is relatively stable, consider denormalizing by duplicating the relevant id column in foo.bar. This eliminates the need for a join during DELETE but be cautious, as it can increase storage costs and pose challenges for maintaining data consistency.

Additional Considerations

Cost vs. Performance: While these optimizations can help, DELETE operations in BigQuery can still be costly. Balancing cost reduction with performance is crucial.
BigQuery Updates: Keep an eye on BigQuery's release notes. The platform is continuously evolving, and future updates may enhance the optimizer's ability to more efficiently handle joins and subqueries with clustered columns.
BigQuery Cost Calculator: Use the BigQuery pricing calculator to estimate the potential cost savings of different optimization strategies before implementing them.

ak02

Am I correct in saying then that the only benefit of clustering from the point of view of data scan then is to use it in where clause while only querying that table?

I have 2 tables both partitioned on date and clustered on a column say "abc". If I join the two tables on partitioned and clustered column, why should I not expect the total data scan to go down?

ms4446

Clustering in BigQuery primarily benefits data retrieval when filtering on the clustered column in a WHERE clause, especially when querying a single table. Clustering organizes data by sorting rows based on the clustered column, allowing BigQuery to skip irrelevant data blocks and reduce the amount of data scanned. This optimization is particularly effective with literal filters or range queries on the clustered column.

However, when joining two tables that are both partitioned and clustered on the same column, the reduction in data scanned might not be as significant as expected. Although clustering should theoretically optimize these joins by limiting the data scanned to specific blocks, the actual performance depends on how BigQuery’s query optimizer handles the join. The optimizer might not fully leverage clustering in complex queries or when non-literal filters are involved, sometimes opting for a full table scan if it determines that approach to be more efficient.

In single-table queries, clustering works well in reducing data scans, but in join scenarios, the benefits can vary. While partitioning helps by restricting the query to specific partitions, clustering within those partitions might not always be fully exploited, especially if the join condition or WHERE clause is complex.

Clustering is most effective for single-table queries with direct filters on clustered columns, but its impact on join queries can be limited by the complexity of the query and the behavior of BigQuery’s query optimizer. To achieve optimal performance, it’s important to align join conditions with clustered columns and experiment with different query structures.

ak02

Help me understand how should I reduce the total data scanned in an effective way?

Partitioning is limited to only one column and clustering can only be done on 4 columns.

BQ essentially wants its customers to create huge datasets with 100s of columns and only cluster 4 of them? On top of it, the columnar database setup makes the data scan directly proportional to the columns getting pulled.

If I denormalize multiple tables into massive flat file styled tables, I only have 4 columns to cluster and reduce data scans. If I normalize tables and break them into multiple partitioned and clustered tables, I am still unable to reduce the data scanned.

I get charged for large data scans and there doesn't seem to be a good feature BQ which allows me to handle this.

ms4446

BigQuery stores data in columns and allows you to organize it with partitioning and clustering, which are key tools for optimizing query performance. Partitioning breaks your data into smaller chunks based on one column, often a date, while clustering further organizes data within each partition using up to four columns. These techniques help BigQuery locate the data you need more efficiently, but they come with limitations. You can only partition on one column and cluster on four, which can be challenging when dealing with large, complex tables, especially if your queries frequently involve joins.

To minimize data scans and control costs in BigQuery, consider the following strategies:

Partition Wisely: Choose partition columns that you frequently filter on in your queries. This allows BigQuery to skip over irrelevant partitions, reducing the amount of data scanned.
Cluster Strategically: Cluster tables based on columns that are commonly used in WHERE, GROUP BY, or JOIN clauses. This helps BigQuery quickly access the relevant data within each partition.
Be Selective in Queries: Instead of using SELECT *, specify only the columns you need. This reduces the amount of data that BigQuery has to read, lowering your costs.
Consider Normalization: Break down large tables into smaller, more focused ones. While this can make queries more complex, it can also reduce the volume of data scanned by isolating only the necessary information.
Use Materialized Views or Staging Tables: Materialized views store pre-computed query results, speeding up subsequent queries. Staging tables can be used to decompose complex queries into smaller, more manageable steps, optimizing performance and reducing data scans.

ak02

The partitioning is only possible on integers or time. This is a huge limitation.

If I cluster tables and normalize them, then I have to join them, If I join them, then I very limited flexibility because partitioning is severely restricted.

If I join normalized tables on partitioned columns and clustered columns, the data scanned remains unimpacted.

You have to factor in that partitioning doesn't work on string fields. Please provide a solution that considers these restrictions.

Why a full table scan when joining a clustered column to another table?