Hello,
I'm currently involved in a project that aims to deprecate data pipelines generating BigQuery tables that haven't been accessed in the past X months. While the primary intent is to monitor the tables for consumption activities such as querying, sinking data to other storage solutions, and retrieving data via APIs, I've identified a challenge.
I've noticed that INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION and INFORMATION_SCHEMA.JOBS_BY_PROJECT don't seem to account for external dependencies related to the BigQuery tables. Presently, I'm examining the audit_log and exploring viable logic and methods to extract the necessary data.
I'm reaching out to this community in the hope that someone might have faced a similar challenge and found a solution. Specifically, I'm seeking a method to determine the last consumption-based access date on various BigQuery tables, excluding insertions or creations.
Your insights and suggestions would be greatly appreciated.
Thank you for your time and consideration.
Solved! Go to Solution.
Insights and suggestions on determining the last consumption-based access date on BigQuery tables, excluding insertions or creations:
1. Use the audit log
The audit log is a comprehensive record of all activity on your BigQuery project. It can be used to track consumption-based access to your tables by filtering for the following operations:
Once you have filtered the audit log for consumption-based access, you can use the timestamp
column to determine the last time each table was accessed.
2. Use a custom script
You can also write a custom script to extract the last consumption-based access date for your tables. The following Python code shows how to do this:
import pandas as pd
from google.cloud import bigquery
# Connect to BigQuery
client = bigquery.Client()
# Get the audit log for the past 30 days
dataset = client.dataset('audit_logs')
table = dataset.table('data_access')
query = f"""
SELECT
timestamp,
operation,
resource_type,
resource_name
FROM
`{table.project}.{table.dataset_id}.{table.table_id}`
WHERE
operation = 'jobservice.jobcompleted'
AND methodName = 'QUERY'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()
# Filter for consumption-based access
df_filtered = df[df['resource_type'] == 'TABLE']
# Get the last consumption-based access date for each table
last_consumption_date_per_table = df_filtered.groupby('resource_name')['timestamp'].max()
# Print the results
print(last_consumption_date_per_table)
This code will print the following output:
resource_name
my_project.my_dataset.my_table 2023-10-10 11:33:00
Name: timestamp, dtype: object
3. Use a third-party tool
There are also a number of third-party tools that can help you to track consumption-based access to your BigQuery tables. These tools typically provide a more user-friendly interface and additional features, such as the ability to set alerts and generate reports.
Recommendation:
I recommend using the audit log to determine the last consumption-based access date for your tables. This is the most reliable and comprehensive method. You can either write a custom script to extract the data from the audit log, or you can use a third-party tool.
Once you have determined the last consumption-based access date for your tables, you can use this information to deprecate data pipelines generating tables that haven't been accessed in the past X months.
Additional considerations:
Yes, you are correct. The BigQuery audit log captures a comprehensive record of interactions with BigQuery resources, including sink or export operations, even if they aren't explicitly labeled as jobCompleted
.
For instance, if you set up a sink job that exports data from a BigQuery table to GCS, the audit log will record details such as:
extract
) and other configuration details.While the audit log provides a wealth of information about each operation, it won't capture the specific data being transferred or exported.
For those who need more granular insights or enhanced analytics on sink or export operations, third-party tools can be considered. These tools often offer capabilities like advanced querying, visualization, and alerting, which can complement the data available in the standard audit logs.