Hi Team,
1. Can we add clustering to any of the existing tables? If yes, will the clustering be applied to existing data in the table?
2. We have parquet files (~70 TB) which would be an efficient way to store the files concerning pricing irrespective of use case? Either GCS or BigQuery.
3. What are the best optimization approaches in BigQuery to save cost?
Any leads on this are much appreciated.
Thanks,
AJ
Here are the responses to your BigQuery questions:
1. Adding Clustering to Existing Tables:
ALTER TABLE
command along with the CLUSTER BY
clause to specify the clustering columns for your table.2. Storing Large Parquet Files: GCS vs. BigQuery:
Here is a breakdown of the two primary options:
3. BigQuery Cost Optimization Techniques:
Here are some of the most effective strategies to control BigQuery costs:
WHERE
clauses. This helps BigQuery scan only the relevant sections of your data (block pruning).SELECT
to specify the exact columns you need instead of using SELECT *
.--dry_run
flag (or the equivalent API setting) to estimate query costs before executing them. Evaluate if your queries would benefit from clustering,partitioning or optimization.Additional Resources:
@ms4446 - I thought the historical data in a non-clustered table would not be clustered as it is one of the limitations in the official documentation.
https://cloud.google.com/bigquery/docs/clustered-tables#limitations
@siddharth_bhand , Apologies for the confusion. You are correct. Historical data in the table will not be retroactively clustered. When adding clustering to an existing non-clustered table will only apply the clustering organization to data inserted or modified after the clustering is implemented.
Here's the breakdown:
Strategies:
INSERT INTO [new_clustered_table] SELECT * FROM [old_non_clustered_table]
type of query.Thanks for the clarification, @ms4446, as I was on the same page as @siddharth_bhand with the clustering limitations!
My use case is a bit different but based on what has been presented in this thread, in theory one could run the `ALTER TABLE` command and add clustering fields, followed by the `TRUNCATE TABLE` to remove all the data before `INSERT INTO` is used to repopulate the table. The history is included in my batch process, so I want to ensure the newly loaded data is freshly clustered since it will not be appended. Does this track?