I'm trying to find a way to find references to our warehouse views. I see that when a view is referenced in a query, the referenced_tables columns in JOBS view shows the underlying tables but not the views that are actually referenced in the FROM clause of the query.
How can we find the counts of direct references of the views but not the underlying tables. ?
Solved! Go to Solution.
The INFORMATION_SCHEMA.JOBS
view in BigQuery does not directly provide information about view references. However, you can extract this information through the following workaround:
Query the query
Column: The query
column in the INFORMATION_SCHEMA.JOBS
view contains the text of the executed SQL query. You can parse this query text to identify view references.
Identify Views in the Query Text: Look for patterns in the query text like FROM view_name
or JOIN view_name
. You may need to handle complex queries with subqueries or aliases.
Aggregate View References: Group the extracted view references to count how many times each view is used. Filter out references to underlying tables if you only want direct view references.
WITH QueryReferences AS (
SELECT
job_id,
query,
REGEXP_EXTRACT_ALL(
LOWER(query),
r'from\s+`?(\w+\.)?(\w+)\.(\w+)`?'
) AS views
FROM
`region-us`.INFORMATION_SCHEMA.JOBS
WHERE
-- Filter for the relevant time period
creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()
AND state = 'DONE'
AND job_type = 'QUERY'
AND error_result IS NULL
)
SELECT
views[OFFSET(2)] AS referenced_view,
COUNT(*) AS reference_count
FROM
QueryReferences,
UNNEST(views) AS views
WHERE
views[OFFSET(1)] NOT LIKE '%INFORMATION_SCHEMA%'
AND views[OFFSET(1)] NOT IN ('other_excluded_view1', 'other_excluded_view2')
GROUP BY
referenced_view
ORDER BY
reference_count DESC;
Explanation:
Common Table Expression (CTE) QueryReferences
:
Extracts job_id
and query
from the INFORMATION_SCHEMA.JOBS
view.
Uses REGEXP_EXTRACT_ALL
to find all occurrences of dataset_id.view_name
in the FROM
clause of the query.
Filters by job_type = 'QUERY'
, state = 'DONE'
, and error_result IS NULL
to only include completed and successful query jobs.
Uses LOWER
to perform a case-insensitive search.
Main Query:
Unnests the extracted views
array to list each view individually.
Extracts the view name and counts the number of times each view is referenced.
Uses the OFFSET
function to correctly parse the extracted parts of the view names.
Filters out system views or unwanted views in the WHERE
clause.
Groups by referenced_view
and orders the result by reference_count
in descending order to show the most referenced views first.
Important Considerations:
Performance: Parsing large volumes of query text can be computationally expensive. Consider filtering the INFORMATION_SCHEMA.JOBS
view based on project, user, or time range to improve performance.
Complexity: Parsing complex SQL queries accurately can be challenging. You might need to refine your regular expressions to handle different query patterns.
Accuracy: This method relies on analyzing the raw query text. There's a chance of false positives or negatives if your regular expressions are not comprehensive enough.
The INFORMATION_SCHEMA.JOBS
view in BigQuery does not directly provide information about view references. However, you can extract this information through the following workaround:
Query the query
Column: The query
column in the INFORMATION_SCHEMA.JOBS
view contains the text of the executed SQL query. You can parse this query text to identify view references.
Identify Views in the Query Text: Look for patterns in the query text like FROM view_name
or JOIN view_name
. You may need to handle complex queries with subqueries or aliases.
Aggregate View References: Group the extracted view references to count how many times each view is used. Filter out references to underlying tables if you only want direct view references.
WITH QueryReferences AS (
SELECT
job_id,
query,
REGEXP_EXTRACT_ALL(
LOWER(query),
r'from\s+`?(\w+\.)?(\w+)\.(\w+)`?'
) AS views
FROM
`region-us`.INFORMATION_SCHEMA.JOBS
WHERE
-- Filter for the relevant time period
creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()
AND state = 'DONE'
AND job_type = 'QUERY'
AND error_result IS NULL
)
SELECT
views[OFFSET(2)] AS referenced_view,
COUNT(*) AS reference_count
FROM
QueryReferences,
UNNEST(views) AS views
WHERE
views[OFFSET(1)] NOT LIKE '%INFORMATION_SCHEMA%'
AND views[OFFSET(1)] NOT IN ('other_excluded_view1', 'other_excluded_view2')
GROUP BY
referenced_view
ORDER BY
reference_count DESC;
Explanation:
Common Table Expression (CTE) QueryReferences
:
Extracts job_id
and query
from the INFORMATION_SCHEMA.JOBS
view.
Uses REGEXP_EXTRACT_ALL
to find all occurrences of dataset_id.view_name
in the FROM
clause of the query.
Filters by job_type = 'QUERY'
, state = 'DONE'
, and error_result IS NULL
to only include completed and successful query jobs.
Uses LOWER
to perform a case-insensitive search.
Main Query:
Unnests the extracted views
array to list each view individually.
Extracts the view name and counts the number of times each view is referenced.
Uses the OFFSET
function to correctly parse the extracted parts of the view names.
Filters out system views or unwanted views in the WHERE
clause.
Groups by referenced_view
and orders the result by reference_count
in descending order to show the most referenced views first.
Important Considerations:
Performance: Parsing large volumes of query text can be computationally expensive. Consider filtering the INFORMATION_SCHEMA.JOBS
view based on project, user, or time range to improve performance.
Complexity: Parsing complex SQL queries accurately can be challenging. You might need to refine your regular expressions to handle different query patterns.
Accuracy: This method relies on analyzing the raw query text. There's a chance of false positives or negatives if your regular expressions are not comprehensive enough.