Solved: bigquery update single columns billed bytes

samirlv · 11-22-2023 08:55 AM

I have a table of 100Gb and i want ot update a non partition column. The billed bytes shows a processing of 100Gb. AFAIK bigqeury architecture is columnar based, do you know why it's not optimizing the table scanning down to single column ?

this was tested on a sample table and the execution graph shows no shuffling.

i highlight that the update is applied only on non partition column as it's understandable that in case of update of partition column leads to shuffling.

Sample query:

UPDATE mytable SET columnA = "blabla" where TRUE

ms4446

You are correct that BigQuery uses a columnar storage format, which is efficient for accessing individual columns. However, this efficiency in querying does not extend to update operations, particularly when it comes to optimizing scanning. Here's a more detailed explanation:

Immutable Storage Model: BigQuery's storage model is immutable, meaning that once data is stored, it cannot be directly modified. This is a fundamental aspect of BigQuery's architecture. Therefore, any update operation, no matter how small, results in the creation of a new version of the table with the changes applied. This necessitates scanning and rewriting the entire table, even for updates to a single column.
Full Table Scan for Updates: Despite the columnar storage, BigQuery performs a full table scan when executing DML operations like updates. This is because BigQuery is optimized for bulk data processing and analytics, rather than for transactional or single-row updates. Therefore, even when updating a single non-partition column, BigQuery scans the entire table to identify and apply the changes.
Billing Based on Data Processed: In BigQuery, you are billed for the amount of data processed. In the case of an update operation, this often equates to the size of the entire table, since the whole table is read and then rewritten with the update. This explains the high processing bytes observed in your case.
Optimization Strategies:
- Filter Data Before Updating: Use a WHERE clause to limit the update operation to specific rows. This can reduce the amount of data processed, although it still involves a full table rewrite.
- Table Partitioning: Partitioning your table based on frequently updated subsets of data can make updates more efficient, as BigQuery can target specific partitions instead of the entire table.
- Clustered Tables: For updates involving multiple columns, clustered tables can improve performance by co-locating related data, thus optimizing the update process.
- Alternative Solutions for Frequent Updates: If your use case involves frequent, specific-row updates, BigQuery might not be the most suitable solution. In such scenarios, consider using a system designed for OLTP workloads, like Google Cloud SQL.

Again, the reason your update query scanned the entire table, despite targeting a single non-partition column, is due to BigQuery's immutable storage model and its design for bulk data processing. By applying the suggested optimization strategies, you can potentially improve the efficiency of your BigQuery operations and manage processing costs more effectively.

View solution in original post

ms4446

You are correct that BigQuery uses a columnar storage format, which is efficient for accessing individual columns. However, this efficiency in querying does not extend to update operations, particularly when it comes to optimizing scanning. Here's a more detailed explanation:

Immutable Storage Model: BigQuery's storage model is immutable, meaning that once data is stored, it cannot be directly modified. This is a fundamental aspect of BigQuery's architecture. Therefore, any update operation, no matter how small, results in the creation of a new version of the table with the changes applied. This necessitates scanning and rewriting the entire table, even for updates to a single column.
Full Table Scan for Updates: Despite the columnar storage, BigQuery performs a full table scan when executing DML operations like updates. This is because BigQuery is optimized for bulk data processing and analytics, rather than for transactional or single-row updates. Therefore, even when updating a single non-partition column, BigQuery scans the entire table to identify and apply the changes.
Billing Based on Data Processed: In BigQuery, you are billed for the amount of data processed. In the case of an update operation, this often equates to the size of the entire table, since the whole table is read and then rewritten with the update. This explains the high processing bytes observed in your case.
Optimization Strategies:
- Filter Data Before Updating: Use a WHERE clause to limit the update operation to specific rows. This can reduce the amount of data processed, although it still involves a full table rewrite.
- Table Partitioning: Partitioning your table based on frequently updated subsets of data can make updates more efficient, as BigQuery can target specific partitions instead of the entire table.
- Clustered Tables: For updates involving multiple columns, clustered tables can improve performance by co-locating related data, thus optimizing the update process.
- Alternative Solutions for Frequent Updates: If your use case involves frequent, specific-row updates, BigQuery might not be the most suitable solution. In such scenarios, consider using a system designed for OLTP workloads, like Google Cloud SQL.

Again, the reason your update query scanned the entire table, despite targeting a single non-partition column, is due to BigQuery's immutable storage model and its design for bulk data processing. By applying the suggested optimization strategies, you can potentially improve the efficiency of your BigQuery operations and manage processing costs more effectively.