Hi Everyone!
I am currently new using bigquery INFORMATION_SCHEMA views.
I would like to know given an especific job_id executed, how to extract the columns/fields used from tables.
I am aware that you can use the string field: "query" but is quite hard to process this kind of strings
I am open to any suggestions!!!
Thanks
Solved! Go to Solution.
HI @bastian_moraga- ,
Here's are some approaches you can take to extract the column information you need from job IDs, along with some important considerations:
Simplified Extraction:
This approach uses regular expressions within SQL to quickly identify potential column names. Here’s a SQL snippet for clarity:
WITH job_info AS (
SELECT query FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE job_id = 'your_job_id'
)
SELECT ARRAY_AGG(DISTINCT column_name) AS extracted_columns
FROM job_info, UNNEST(SPLIT(REGEXP_EXTRACT(query, r'SELECT\s+(.*?)\s+FROM'), ',')) AS column_name;
Advanced Parsing:
For complex queries, utilizing a SQL parser like sqlparse in Python can provide a more robust solution:
import sqlparse
# Example query fetch
query = "SELECT t1.column1 AS col1, SUM(t2.column2) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id"
parsed = sqlparse.parse(query)
extracted_columns = []
for token in parsed[0].tokens:
if isinstance(token, sqlparse.sql.IdentifierList):
for identifier in token.get_identifiers():
extracted_columns.append(identifier.get_name() or identifier.get_real_name())
print(extracted_columns) # Outputs the extracted columns
BigQuery Audit Logs :
If you need a high degree of accuracy and insight into actual column usage, exploring BigQuery's audit logs can be beneficial. Here’s how you can query them:
SELECT protoPayload.resourceName AS table_name,
protoPayload.authorizationInfo.permission,
ARRAY_AGG(DISTINCT REGEXP_EXTRACT(protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.query.query, r"(\w+)\s*(?:AS|=)")) AS accessed_columns
FROM `your-project-id.your-dataset-id.cloudaudit_googleapis_com_data_access_*`
WHERE protoPayload.resourceName LIKE 'projects/%/datasets/%/tables/%'
AND protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.query.query IS NOT NULL
GROUP BY table_name, permission;
Additional Tips:
HI @bastian_moraga- ,
Here's are some approaches you can take to extract the column information you need from job IDs, along with some important considerations:
Simplified Extraction:
This approach uses regular expressions within SQL to quickly identify potential column names. Here’s a SQL snippet for clarity:
WITH job_info AS (
SELECT query FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE job_id = 'your_job_id'
)
SELECT ARRAY_AGG(DISTINCT column_name) AS extracted_columns
FROM job_info, UNNEST(SPLIT(REGEXP_EXTRACT(query, r'SELECT\s+(.*?)\s+FROM'), ',')) AS column_name;
Advanced Parsing:
For complex queries, utilizing a SQL parser like sqlparse in Python can provide a more robust solution:
import sqlparse
# Example query fetch
query = "SELECT t1.column1 AS col1, SUM(t2.column2) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id"
parsed = sqlparse.parse(query)
extracted_columns = []
for token in parsed[0].tokens:
if isinstance(token, sqlparse.sql.IdentifierList):
for identifier in token.get_identifiers():
extracted_columns.append(identifier.get_name() or identifier.get_real_name())
print(extracted_columns) # Outputs the extracted columns
BigQuery Audit Logs :
If you need a high degree of accuracy and insight into actual column usage, exploring BigQuery's audit logs can be beneficial. Here’s how you can query them:
SELECT protoPayload.resourceName AS table_name,
protoPayload.authorizationInfo.permission,
ARRAY_AGG(DISTINCT REGEXP_EXTRACT(protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.query.query, r"(\w+)\s*(?:AS|=)")) AS accessed_columns
FROM `your-project-id.your-dataset-id.cloudaudit_googleapis_com_data_access_*`
WHERE protoPayload.resourceName LIKE 'projects/%/datasets/%/tables/%'
AND protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.query.query IS NOT NULL
GROUP BY table_name, permission;
Additional Tips:
How effective is sqlparse handling queries with joins, CTE's or subqueries?
Thanks you so much for your answer!!!!
The sqlparse
library can be quite helpful in breaking down and analyzing SQL queries, but it does have limitations, especially when dealing with complex SQL structures such as joins, Common Table Expressions (CTEs), or subqueries. Here's how it typically performs with these constructs:
Joins: sqlparse
can generally identify and split different components of a query, including the JOIN clauses. It will parse the join conditions and can help you separate and identify the tables being joined and the conditions used. However, it might not directly provide a hierarchical relationship or the logical flow of joins unless you manually interpret the parsed tokens.
CTEs (Common Table Expressions): sqlparse
is capable of identifying CTEs within a query. It will parse them as distinct tokens, allowing you to extract and review the CTE definitions separately from the main query. However, like with joins, understanding the interaction between CTEs and the main query will require a bit of manual parsing logic.
Subqueries: This is where sqlparse
may start to struggle. While it can identify subqueries by parsing them into nested structures, effectively understanding their role within the main query and extracting useful metadata programmatically can be challenging. The library doesn't inherently resolve or simplify the analysis of how subqueries interact with other parts of the SQL query.
Effectiveness and Considerations:
sqlparse
does well in breaking down a query into tokens, which is useful for syntax highlighting or basic parsing needs. However, the tokens might need manual interpretation to be fully useful in a complex analysis.sqlparse
might provide the structural breakdown but not the semantic understanding. You may need additional logic to handle these complexities.sqlparse
lays the groundwork by parsing the SQL into chunks, you'll need to write custom scripts to interpret these chunks according to your specific needs.If you frequently need to analyze complex SQL queries for specific elements like column extraction or dependency analysis, you might consider more sophisticated SQL parsing tools or services that can offer deeper insights into SQL query structures. For one-off or simpler tasks, sqlparse
combined with custom Python scripting can be quite effective.