In our BigQuery environment, we have reporting tables that are mostly OBT tables. But they are not partitioned and clustered well. Most were created by analysts who didn't consider forecasted usage to identify good partitioning and clustering scheme.
Question:
I want to identify most used columns for each table in a dataset to help recommend columns for partitioning and clustering scheme for each table. I assume I need to parse query for WHERE clauses, GROUP BY clauses or JOIN conditions and the frequency of usage of each columns.
Is there out-of-box way to identify this information for each table in a dataset ?
Solved! Go to Solution.
In BigQuery, effectively partitioning and clustering your tables can significantly improve query performance and cost efficiency. Here are a few methods to identify frequently used columns for this purpose:
1. BigQuery Partitioning and Clustering Recommender (Recommended)
How it Works: This built-in tool leverages machine learning to analyze your query patterns over the last 30 days. It identifies tables that would benefit from partitioning or clustering and suggests optimal columns.
Pros:
Cons:
Where to Find It: In the BigQuery console, go to the "Recommendations" tab under your project.
2. Analyzing Query Logs (DIY Approach)
How it Works: Query the INFORMATION_SCHEMA.JOBS_BY_PROJECT
table to access details about executed queries, including the SQL text. You then parse this text to extract columns used in WHERE
, GROUP BY
, and JOIN
clauses.
Pros:
Cons:
Example Query:
SELECT
statement_type,
REGEXP_EXTRACT_ALL(query, r'WHERE\s+([\w.]+)\s*') AS where_clauses,
REGEXP_EXTRACT_ALL(query, r'GROUP\s+BY\s+([\w.]+)\s*') AS group_by_clauses,
REGEXP_EXTRACT_ALL(query, r'JOIN\s+([\w.]+)\s*') AS join_tables
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE';
This query extracts the mentioned clauses from queries run in the last 7 days. Further analyze this data to identify frequently used columns.
3. Third-Party Tools
Several third-party tools and services can streamline this process:
Key Considerations for Effective Optimization
Additional Tips for OBT (One Big Table) Tables
By combining these techniques and insights, you'll be well-equipped to optimize your BigQuery tables, leading to faster queries, reduced costs, and a more efficient data warehouse.
In BigQuery, effectively partitioning and clustering your tables can significantly improve query performance and cost efficiency. Here are a few methods to identify frequently used columns for this purpose:
1. BigQuery Partitioning and Clustering Recommender (Recommended)
How it Works: This built-in tool leverages machine learning to analyze your query patterns over the last 30 days. It identifies tables that would benefit from partitioning or clustering and suggests optimal columns.
Pros:
Cons:
Where to Find It: In the BigQuery console, go to the "Recommendations" tab under your project.
2. Analyzing Query Logs (DIY Approach)
How it Works: Query the INFORMATION_SCHEMA.JOBS_BY_PROJECT
table to access details about executed queries, including the SQL text. You then parse this text to extract columns used in WHERE
, GROUP BY
, and JOIN
clauses.
Pros:
Cons:
Example Query:
SELECT
statement_type,
REGEXP_EXTRACT_ALL(query, r'WHERE\s+([\w.]+)\s*') AS where_clauses,
REGEXP_EXTRACT_ALL(query, r'GROUP\s+BY\s+([\w.]+)\s*') AS group_by_clauses,
REGEXP_EXTRACT_ALL(query, r'JOIN\s+([\w.]+)\s*') AS join_tables
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE';
This query extracts the mentioned clauses from queries run in the last 7 days. Further analyze this data to identify frequently used columns.
3. Third-Party Tools
Several third-party tools and services can streamline this process:
Key Considerations for Effective Optimization
Additional Tips for OBT (One Big Table) Tables
By combining these techniques and insights, you'll be well-equipped to optimize your BigQuery tables, leading to faster queries, reduced costs, and a more efficient data warehouse.