Solved: Re: Retrieving Last Access Date for BigQuery Table...

ylee1212 · 10-10-2023 07:42 AM

Hello,

I'm currently involved in a project that aims to deprecate data pipelines generating BigQuery tables that haven't been accessed in the past X months. While the primary intent is to monitor the tables for consumption activities such as querying, sinking data to other storage solutions, and retrieving data via APIs, I've identified a challenge.

I've noticed that INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION and INFORMATION_SCHEMA.JOBS_BY_PROJECT don't seem to account for external dependencies related to the BigQuery tables. Presently, I'm examining the audit_log and exploring viable logic and methods to extract the necessary data.

I'm reaching out to this community in the hope that someone might have faced a similar challenge and found a solution. Specifically, I'm seeking a method to determine the last consumption-based access date on various BigQuery tables, excluding insertions or creations.

Your insights and suggestions would be greatly appreciated.

Thank you for your time and consideration.

ms4446

Insights and suggestions on determining the last consumption-based access date on BigQuery tables, excluding insertions or creations:

1. Use the audit log

The audit log is a comprehensive record of all activity on your BigQuery project. It can be used to track consumption-based access to your tables by filtering for the following operations:

jobservice.jobcompleted with a methodName of QUERY

Once you have filtered the audit log for consumption-based access, you can use the timestamp column to determine the last time each table was accessed.

2. Use a custom script

You can also write a custom script to extract the last consumption-based access date for your tables. The following Python code shows how to do this:

import pandas as pd
from google.cloud import bigquery

# Connect to BigQuery
client = bigquery.Client()

# Get the audit log for the past 30 days
dataset = client.dataset('audit_logs')
table = dataset.table('data_access')
query = f"""
SELECT
  timestamp,
  operation,
  resource_type,
  resource_name
FROM
  `{table.project}.{table.dataset_id}.{table.table_id}`
WHERE
  operation = 'jobservice.jobcompleted'
  AND methodName = 'QUERY'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()

# Filter for consumption-based access
df_filtered = df[df['resource_type'] == 'TABLE']

# Get the last consumption-based access date for each table
last_consumption_date_per_table = df_filtered.groupby('resource_name')['timestamp'].max()

# Print the results
print(last_consumption_date_per_table)

This code will print the following output:

resource_name
my_project.my_dataset.my_table    2023-10-10 11:33:00
Name: timestamp, dtype: object

3. Use a third-party tool

There are also a number of third-party tools that can help you to track consumption-based access to your BigQuery tables. These tools typically provide a more user-friendly interface and additional features, such as the ability to set alerts and generate reports.

Recommendation:

I recommend using the audit log to determine the last consumption-based access date for your tables. This is the most reliable and comprehensive method. You can either write a custom script to extract the data from the audit log, or you can use a third-party tool.

Once you have determined the last consumption-based access date for your tables, you can use this information to deprecate data pipelines generating tables that haven't been accessed in the past X months.

Additional considerations:

Keep in mind that the audit log can be large and expensive to query, especially for large projects. You may want to consider sampling the audit log or using a third-party tool that can pre-process the data for you.
When deprecating data pipelines, be sure to carefully consider the impact on your downstream applications. You may need to notify users of the change and provide them with a migration plan.
You may also want to consider implementing a data retention policy to ensure that your BigQuery tables are not deleted prematurely.

View solution in original post

ms4446

Yes, you are correct. The BigQuery audit log captures a comprehensive record of interactions with BigQuery resources, including sink or export operations, even if they aren't explicitly labeled as jobCompleted.

For instance, if you set up a sink job that exports data from a BigQuery table to GCS, the audit log will record details such as:

The user or service account initiating the job.
The timestamp when the job was executed.
The source BigQuery table being accessed.
The destination GCS bucket where data is being exported.
Metadata associated with the job, including its type (e.g., extract) and other configuration details.

While the audit log provides a wealth of information about each operation, it won't capture the specific data being transferred or exported.

For those who need more granular insights or enhanced analytics on sink or export operations, third-party tools can be considered. These tools often offer capabilities like advanced querying, visualization, and alerting, which can complement the data available in the standard audit logs.

View solution in original post

ms4446

Insights and suggestions on determining the last consumption-based access date on BigQuery tables, excluding insertions or creations:

1. Use the audit log

The audit log is a comprehensive record of all activity on your BigQuery project. It can be used to track consumption-based access to your tables by filtering for the following operations:

jobservice.jobcompleted with a methodName of QUERY

Once you have filtered the audit log for consumption-based access, you can use the timestamp column to determine the last time each table was accessed.

2. Use a custom script

You can also write a custom script to extract the last consumption-based access date for your tables. The following Python code shows how to do this:

import pandas as pd
from google.cloud import bigquery

# Connect to BigQuery
client = bigquery.Client()

# Get the audit log for the past 30 days
dataset = client.dataset('audit_logs')
table = dataset.table('data_access')
query = f"""
SELECT
  timestamp,
  operation,
  resource_type,
  resource_name
FROM
  `{table.project}.{table.dataset_id}.{table.table_id}`
WHERE
  operation = 'jobservice.jobcompleted'
  AND methodName = 'QUERY'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()

# Filter for consumption-based access
df_filtered = df[df['resource_type'] == 'TABLE']

# Get the last consumption-based access date for each table
last_consumption_date_per_table = df_filtered.groupby('resource_name')['timestamp'].max()

# Print the results
print(last_consumption_date_per_table)

This code will print the following output:

resource_name
my_project.my_dataset.my_table    2023-10-10 11:33:00
Name: timestamp, dtype: object

3. Use a third-party tool

There are also a number of third-party tools that can help you to track consumption-based access to your BigQuery tables. These tools typically provide a more user-friendly interface and additional features, such as the ability to set alerts and generate reports.

Recommendation:

I recommend using the audit log to determine the last consumption-based access date for your tables. This is the most reliable and comprehensive method. You can either write a custom script to extract the data from the audit log, or you can use a third-party tool.

Once you have determined the last consumption-based access date for your tables, you can use this information to deprecate data pipelines generating tables that haven't been accessed in the past X months.

Additional considerations:

Keep in mind that the audit log can be large and expensive to query, especially for large projects. You may want to consider sampling the audit log or using a third-party tool that can pre-process the data for you.
When deprecating data pipelines, be sure to carefully consider the impact on your downstream applications. You may need to notify users of the change and provide them with a migration plan.
You may also want to consider implementing a data retention policy to ensure that your BigQuery tables are not deleted prematurely.

ylee1212

Hello @ms4446 !

Thanks for answering my questions with such great details. Will the filtering you suggested,`jobservice.jobcompleted with a methodName of QUERY,` capture all the interactions including external dependencies? For example, if somebody has some sink job to GCS or Bigtables, will those be captured as well?

Thanks again!

ms4446

You are welcome @ylee1212

The filtering I suggested (jobservice.jobcompleted with a methodName of QUERY) primarily captures query operations on BigQuery tables. It does not inherently capture sink jobs to GCS or Bigtable.

For sink jobs or exports to GCS, you would typically look for jobservice.jobcompleted events where the job configuration is of type extract. BigQuery audit logs do not have specific method names for writing to GCS or Bigtable.

The audit logs provide a wealth of information about each operation, including the user or service account that initiated it, the timestamp, the destination, and other relevant details. However, they do not capture the specific data that was written to or read from a table.

For a more comprehensive view of how your BigQuery tables are being used, especially if you need granular details or enhanced analytics, you might consider using a third-party tool that can collect and analyze additional data, such as query logs and job metadata.

ylee1212

Thanks again, @ms4446 ! 😁

To confirm, the audit_log captures all the access logs. So, while it does not explicitly capture all those sink or export operations as the jobCompleted, it will still capture all various access types to BigQuery tables, am I correct?

ms4446

Yes, you are correct. The BigQuery audit log captures a comprehensive record of interactions with BigQuery resources, including sink or export operations, even if they aren't explicitly labeled as jobCompleted.

For instance, if you set up a sink job that exports data from a BigQuery table to GCS, the audit log will record details such as:

The user or service account initiating the job.
The timestamp when the job was executed.
The source BigQuery table being accessed.
The destination GCS bucket where data is being exported.
Metadata associated with the job, including its type (e.g., extract) and other configuration details.

While the audit log provides a wealth of information about each operation, it won't capture the specific data being transferred or exported.

For those who need more granular insights or enhanced analytics on sink or export operations, third-party tools can be considered. These tools often offer capabilities like advanced querying, visualization, and alerting, which can complement the data available in the standard audit logs.

Retrieving Last Access Date for BigQuery Tables Based on Consumption