Hi Team,
1. Can we add clustering to any of the existing tables? If yes, will the clustering be applied to existing data in the table?
2. We have parquet files (~70 TB) which would be an efficient way to store the files concerning pricing irrespective of use case? Either GCS or BigQuery.
3. What are the best optimization approaches in BigQuery to save cost?
Any leads on this are much appreciated.
Thanks,
AJ
Here are the responses to your BigQuery questions:
1. Adding Clustering to Existing Tables:
ALTER TABLE
command along with the CLUSTER BY
clause to specify the clustering columns for your table.2. Storing Large Parquet Files: GCS vs. BigQuery:
Here is a breakdown of the two primary options:
3. BigQuery Cost Optimization Techniques:
Here are some of the most effective strategies to control BigQuery costs:
WHERE
clauses. This helps BigQuery scan only the relevant sections of your data (block pruning).SELECT
to specify the exact columns you need instead of using SELECT *
.--dry_run
flag (or the equivalent API setting) to estimate query costs before executing them. Evaluate if your queries would benefit from clustering,partitioning or optimization.Additional Resources:
@ms4446 - I thought the historical data in a non-clustered table would not be clustered as it is one of the limitations in the official documentation.
https://cloud.google.com/bigquery/docs/clustered-tables#limitations
@siddharth_bhand , Apologies for the confusion. You are correct. Historical data in the table will not be retroactively clustered. When adding clustering to an existing non-clustered table will only apply the clustering organization to data inserted or modified after the clustering is implemented.
Here's the breakdown:
Strategies:
INSERT INTO [new_clustered_table] SELECT * FROM [old_non_clustered_table]
type of query.User | Count |
---|---|
5 | |
4 | |
2 | |
1 | |
1 |