Re: Clarification on clustering, optimization, and...

ArockiyaJerson · 01-18-2024 04:01 AM

Hi Team,

1. Can we add clustering to any of the existing tables? If yes, will the clustering be applied to existing data in the table?

2. We have parquet files (~70 TB) which would be an efficient way to store the files concerning pricing irrespective of use case? Either GCS or BigQuery.

3. What are the best optimization approaches in BigQuery to save cost?

Any leads on this are much appreciated.

Thanks,

AJ

ms4446

Here are the responses to your BigQuery questions:

1. Adding Clustering to Existing Tables:

Yes! You can add clustering to existing BigQuery tables to optimize query performance and reduce cost. Use the ALTER TABLE command along with the CLUSTER BYclause to specify the clustering columns for your table.
When you apply clustering, BigQuery will reorganize and re-cluster the existing data based on your chosen clustering configuration. Be aware that this process can be resource-intensive and may take some time for very large tables.

2. Storing Large Parquet Files: GCS vs. BigQuery:

Here is a breakdown of the two primary options:

Google Cloud Storage (GCS): An ideal choice for storing vast amounts of raw data in its original format (such as Parquet),particularly when you don't need to perform frequent SQL-based analysis on all of it within BigQuery. GCS is highly cost-effective for data storage and offers great flexibility by allowing other data processing tools (like Hadoop or Spark) to access the files.
BigQuery: Load the Parquet files directly into BigQuery when your primary use case involves complex SQL queries, in-depth analysis, and you need the fast performance that the BigQuery engine provides. BigQuery can natively query Parquet files in GCS, so in some cases you might not even need to load the full dataset to take advantage of BigQuery's analytics.

3. BigQuery Cost Optimization Techniques:

Here are some of the most effective strategies to control BigQuery costs:

Clustering and Partitioning:
- Organize data using clustering based on columns frequently used in WHEREclauses. This helps BigQuery scan only the relevant sections of your data (block pruning).
- Use partitioning, especially date-based partitioning, to further limit the data scanned.
Query Optimization:
- Write efficient SQL queries that only process necessary data.
- Use SELECT to specify the exact columns you need instead of using SELECT *.
- Apply filters as early as possible in your query logic to reduce data processing.
Pricing Models:
- If your usage patterns are predictable,consider BigQuery's flat-rate pricing for potentially significant cost savings.
- For ad-hoc or variable workloads, the on-demand pricing model might be more appropriate.
BigQuery Storage API (Preview): If this approach fits your requirements, the BigQuery Storage API often provides even faster performance than traditional BigQuery querying methods, reducing the overall amount of data scanned.
Pricing Queries: Use the --dry_run flag (or the equivalent API setting) to estimate query costs before executing them. Evaluate if your queries would benefit from clustering,partitioning or optimization.
Reduce Storage Costs: Regularly review and delete unneeded data or tables to reduce unnecessary storage costs.

Additional Resources:

[BigQuery Clustered Tables]:https://cloud.google.com/bigquery/docs/clustered-tables:https://cloud.google.com/bigquery/docs/clustered-tables
[BigQuery Partitioned Tables]:https://cloud.google.com/bigquery/docs/partitioned-tables:https://cloud.google.com/bigquery/docs/partitioned-tables
[BigQuery Pricing Guide]:https://cloud.google.com/bigquery/pricing:https://cloud.google.com/bigquery/pricing

siddharth_bhand

@ms4446 - I thought the historical data in a non-clustered table would not be clustered as it is one of the limitations in the official documentation.
https://cloud.google.com/bigquery/docs/clustered-tables#limitations

ms4446

@siddharth_bhand , Apologies for the confusion. You are correct. Historical data in the table will not be retroactively clustered. When adding clustering to an existing non-clustered table will only apply the clustering organization to data inserted or modified after the clustering is implemented.

Here's the breakdown:

New Data: Any data added after you add clustering will automatically be sorted and organized based on your chosen clustering column(s). BigQuery will maintain this clustering within each operation that modifies the table.
Existing Data: Data that existed in the table before clustering was applied will remain in its original un-clustered organization. This means queries that would benefit from clustering might not see the full performance gains if they also need to scan the older, unclustered data.

Strategies:

If your table is largely static (not updated frequently) and the historical data is important to query performance, it might be beneficial to create a new clustered table and migrate the data over. You can use the INSERT INTO [new_clustered_table] SELECT * FROM [old_non_clustered_table] type of query.
If updates are frequent, the gradual clustering of new data will still provide some query performance benefits. However, a periodic full re-clustering process may be required depending on your query patterns.

greend

Thanks for the clarification, @ms4446, as I was on the same page as @siddharth_bhand with the clustering limitations!

My use case is a bit different but based on what has been presented in this thread, in theory one could run the `ALTER TABLE` command and add clustering fields, followed by the `TRUNCATE TABLE` to remove all the data before `INSERT INTO` is used to repopulate the table. The history is included in my batch process, so I want to ensure the newly loaded data is freshly clustered since it will not be appended. Does this track?

Clarification on clustering, optimization, and storage