Solved: INFORMATION_SCHEMA.JOBS view does not show referen...

crazyrevol · 06-16-2024 10:02 PM

I'm trying to find a way to find references to our warehouse views. I see that when a view is referenced in a query, the referenced_tables columns in JOBS view shows the underlying tables but not the views that are actually referenced in the FROM clause of the query.

How can we find the counts of direct references of the views but not the underlying tables. ?

ms4446

The INFORMATION_SCHEMA.JOBS view in BigQuery does not directly provide information about view references. However, you can extract this information through the following workaround:

Query the query Column: The query column in the INFORMATION_SCHEMA.JOBS view contains the text of the executed SQL query. You can parse this query text to identify view references.
Identify Views in the Query Text: Look for patterns in the query text like FROM view_name or JOIN view_name. You may need to handle complex queries with subqueries or aliases.
Aggregate View References: Group the extracted view references to count how many times each view is used. Filter out references to underlying tables if you only want direct view references.

WITH QueryReferences AS (
  SELECT
    job_id,
    query,
    REGEXP_EXTRACT_ALL(
      LOWER(query),
      r'from\s+`?(\w+\.)?(\w+)\.(\w+)`?'
    ) AS views
  FROM
    `region-us`.INFORMATION_SCHEMA.JOBS
  WHERE
    -- Filter for the relevant time period
    creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()
    AND state = 'DONE'
    AND job_type = 'QUERY'
    AND error_result IS NULL
)

SELECT
  views[OFFSET(2)] AS referenced_view,
  COUNT(*) AS reference_count
FROM
  QueryReferences,
  UNNEST(views) AS views
WHERE
  views[OFFSET(1)] NOT LIKE '%INFORMATION_SCHEMA%'
  AND views[OFFSET(1)] NOT IN ('other_excluded_view1', 'other_excluded_view2')
GROUP BY
  referenced_view
ORDER BY
  reference_count DESC;

Explanation:

Common Table Expression (CTE) QueryReferences:
- Extracts job_id and query from the INFORMATION_SCHEMA.JOBS view.
- Uses REGEXP_EXTRACT_ALL to find all occurrences of dataset_id.view_name in the FROM clause of the query.
- Filters by job_type = 'QUERY', state = 'DONE', and error_result IS NULL to only include completed and successful query jobs.
- Uses LOWER to perform a case-insensitive search.
Main Query:
- Unnests the extracted views array to list each view individually.
- Extracts the view name and counts the number of times each view is referenced.
- Uses the OFFSET function to correctly parse the extracted parts of the view names.
- Filters out system views or unwanted views in the WHERE clause.
- Groups by referenced_view and orders the result by reference_count in descending order to show the most referenced views first.

Important Considerations:

Performance: Parsing large volumes of query text can be computationally expensive. Consider filtering the INFORMATION_SCHEMA.JOBS view based on project, user, or time range to improve performance.
Complexity: Parsing complex SQL queries accurately can be challenging. You might need to refine your regular expressions to handle different query patterns.
Accuracy: This method relies on analyzing the raw query text. There's a chance of false positives or negatives if your regular expressions are not comprehensive enough.

View solution in original post

ms4446

The INFORMATION_SCHEMA.JOBS view in BigQuery does not directly provide information about view references. However, you can extract this information through the following workaround:

Query the query Column: The query column in the INFORMATION_SCHEMA.JOBS view contains the text of the executed SQL query. You can parse this query text to identify view references.
Identify Views in the Query Text: Look for patterns in the query text like FROM view_name or JOIN view_name. You may need to handle complex queries with subqueries or aliases.
Aggregate View References: Group the extracted view references to count how many times each view is used. Filter out references to underlying tables if you only want direct view references.

WITH QueryReferences AS (
  SELECT
    job_id,
    query,
    REGEXP_EXTRACT_ALL(
      LOWER(query),
      r'from\s+`?(\w+\.)?(\w+)\.(\w+)`?'
    ) AS views
  FROM
    `region-us`.INFORMATION_SCHEMA.JOBS
  WHERE
    -- Filter for the relevant time period
    creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()
    AND state = 'DONE'
    AND job_type = 'QUERY'
    AND error_result IS NULL
)

SELECT
  views[OFFSET(2)] AS referenced_view,
  COUNT(*) AS reference_count
FROM
  QueryReferences,
  UNNEST(views) AS views
WHERE
  views[OFFSET(1)] NOT LIKE '%INFORMATION_SCHEMA%'
  AND views[OFFSET(1)] NOT IN ('other_excluded_view1', 'other_excluded_view2')
GROUP BY
  referenced_view
ORDER BY
  reference_count DESC;

Explanation:

Common Table Expression (CTE) QueryReferences:
- Extracts job_id and query from the INFORMATION_SCHEMA.JOBS view.
- Uses REGEXP_EXTRACT_ALL to find all occurrences of dataset_id.view_name in the FROM clause of the query.
- Filters by job_type = 'QUERY', state = 'DONE', and error_result IS NULL to only include completed and successful query jobs.
- Uses LOWER to perform a case-insensitive search.
Main Query:
- Unnests the extracted views array to list each view individually.
- Extracts the view name and counts the number of times each view is referenced.
- Uses the OFFSET function to correctly parse the extracted parts of the view names.
- Filters out system views or unwanted views in the WHERE clause.
- Groups by referenced_view and orders the result by reference_count in descending order to show the most referenced views first.

Important Considerations:

Performance: Parsing large volumes of query text can be computationally expensive. Consider filtering the INFORMATION_SCHEMA.JOBS view based on project, user, or time range to improve performance.
Complexity: Parsing complex SQL queries accurately can be challenging. You might need to refine your regular expressions to handle different query patterns.
Accuracy: This method relies on analyzing the raw query text. There's a chance of false positives or negatives if your regular expressions are not comprehensive enough.

INFORMATION_SCHEMA.JOBS view does not show referenced_tables for views