Hello,
I'm currently involved in a project that aims to deprecate data pipelines generating BigQuery tables that haven't been accessed in the past X months. While the primary intent is to monitor the tables for consumption activities such as querying, sinking data to other storage solutions, and retrieving data via APIs, I've identified a challenge.
I've noticed that INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION and INFORMATION_SCHEMA.JOBS_BY_PROJECT don't seem to account for external dependencies related to the BigQuery tables. Presently, I'm examining the audit_log and exploring viable logic and methods to extract the necessary data.
I'm reaching out to this community in the hope that someone might have faced a similar challenge and found a solution. Specifically, I'm seeking a method to determine the last consumption-based access date on various BigQuery tables, excluding insertions or creations.
Your insights and suggestions would be greatly appreciated.
Thank you for your time and consideration.
Solved! Go to Solution.
Insights and suggestions on determining the last consumption-based access date on BigQuery tables, excluding insertions or creations:
1. Use the audit log
The audit log is a comprehensive record of all activity on your BigQuery project. It can be used to track consumption-based access to your tables by filtering for the following operations:
Once you have filtered the audit log for consumption-based access, you can use the timestamp
column to determine the last time each table was accessed.
2. Use a custom script
You can also write a custom script to extract the last consumption-based access date for your tables. The following Python code shows how to do this:
import pandas as pd
from google.cloud import bigquery
# Connect to BigQuery
client = bigquery.Client()
# Get the audit log for the past 30 days
dataset = client.dataset('audit_logs')
table = dataset.table('data_access')
query = f"""
SELECT
timestamp,
operation,
resource_type,
resource_name
FROM
`{table.project}.{table.dataset_id}.{table.table_id}`
WHERE
operation = 'jobservice.jobcompleted'
AND methodName = 'QUERY'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()
# Filter for consumption-based access
df_filtered = df[df['resource_type'] == 'TABLE']
# Get the last consumption-based access date for each table
last_consumption_date_per_table = df_filtered.groupby('resource_name')['timestamp'].max()
# Print the results
print(last_consumption_date_per_table)
This code will print the following output:
resource_name
my_project.my_dataset.my_table 2023-10-10 11:33:00
Name: timestamp, dtype: object
3. Use a third-party tool
There are also a number of third-party tools that can help you to track consumption-based access to your BigQuery tables. These tools typically provide a more user-friendly interface and additional features, such as the ability to set alerts and generate reports.
Recommendation:
I recommend using the audit log to determine the last consumption-based access date for your tables. This is the most reliable and comprehensive method. You can either write a custom script to extract the data from the audit log, or you can use a third-party tool.
Once you have determined the last consumption-based access date for your tables, you can use this information to deprecate data pipelines generating tables that haven't been accessed in the past X months.
Additional considerations:
Yes, you are correct. The BigQuery audit log captures a comprehensive record of interactions with BigQuery resources, including sink or export operations, even if they aren't explicitly labeled as jobCompleted
.
For instance, if you set up a sink job that exports data from a BigQuery table to GCS, the audit log will record details such as:
extract
) and other configuration details.While the audit log provides a wealth of information about each operation, it won't capture the specific data being transferred or exported.
For those who need more granular insights or enhanced analytics on sink or export operations, third-party tools can be considered. These tools often offer capabilities like advanced querying, visualization, and alerting, which can complement the data available in the standard audit logs.
Insights and suggestions on determining the last consumption-based access date on BigQuery tables, excluding insertions or creations:
1. Use the audit log
The audit log is a comprehensive record of all activity on your BigQuery project. It can be used to track consumption-based access to your tables by filtering for the following operations:
Once you have filtered the audit log for consumption-based access, you can use the timestamp
column to determine the last time each table was accessed.
2. Use a custom script
You can also write a custom script to extract the last consumption-based access date for your tables. The following Python code shows how to do this:
import pandas as pd
from google.cloud import bigquery
# Connect to BigQuery
client = bigquery.Client()
# Get the audit log for the past 30 days
dataset = client.dataset('audit_logs')
table = dataset.table('data_access')
query = f"""
SELECT
timestamp,
operation,
resource_type,
resource_name
FROM
`{table.project}.{table.dataset_id}.{table.table_id}`
WHERE
operation = 'jobservice.jobcompleted'
AND methodName = 'QUERY'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()
# Filter for consumption-based access
df_filtered = df[df['resource_type'] == 'TABLE']
# Get the last consumption-based access date for each table
last_consumption_date_per_table = df_filtered.groupby('resource_name')['timestamp'].max()
# Print the results
print(last_consumption_date_per_table)
This code will print the following output:
resource_name
my_project.my_dataset.my_table 2023-10-10 11:33:00
Name: timestamp, dtype: object
3. Use a third-party tool
There are also a number of third-party tools that can help you to track consumption-based access to your BigQuery tables. These tools typically provide a more user-friendly interface and additional features, such as the ability to set alerts and generate reports.
Recommendation:
I recommend using the audit log to determine the last consumption-based access date for your tables. This is the most reliable and comprehensive method. You can either write a custom script to extract the data from the audit log, or you can use a third-party tool.
Once you have determined the last consumption-based access date for your tables, you can use this information to deprecate data pipelines generating tables that haven't been accessed in the past X months.
Additional considerations:
Hello @ms4446 !
Thanks for answering my questions with such great details. Will the filtering you suggested,`jobservice.jobcompleted with a methodName of QUERY,` capture all the interactions including external dependencies? For example, if somebody has some sink job to GCS or Bigtables, will those be captured as well?
Thanks again!
You are welcome @ylee1212
The filtering I suggested (jobservice.jobcompleted
with a methodName
of QUERY
) primarily captures query operations on BigQuery tables. It does not inherently capture sink jobs to GCS or Bigtable.
For sink jobs or exports to GCS, you would typically look for jobservice.jobcompleted
events where the job configuration is of type extract
. BigQuery audit logs do not have specific method names for writing to GCS or Bigtable.
The audit logs provide a wealth of information about each operation, including the user or service account that initiated it, the timestamp, the destination, and other relevant details. However, they do not capture the specific data that was written to or read from a table.
For a more comprehensive view of how your BigQuery tables are being used, especially if you need granular details or enhanced analytics, you might consider using a third-party tool that can collect and analyze additional data, such as query logs and job metadata.
Thanks again, @ms4446 ! 😁
To confirm, the audit_log captures all the access logs. So, while it does not explicitly capture all those sink or export operations as the jobCompleted, it will still capture all various access types to BigQuery tables, am I correct?
Yes, you are correct. The BigQuery audit log captures a comprehensive record of interactions with BigQuery resources, including sink or export operations, even if they aren't explicitly labeled as jobCompleted
.
For instance, if you set up a sink job that exports data from a BigQuery table to GCS, the audit log will record details such as:
extract
) and other configuration details.While the audit log provides a wealth of information about each operation, it won't capture the specific data being transferred or exported.
For those who need more granular insights or enhanced analytics on sink or export operations, third-party tools can be considered. These tools often offer capabilities like advanced querying, visualization, and alerting, which can complement the data available in the standard audit logs.