Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to get number of records written to bigquery on a specific time period

My requirement is to get the records count that was created / updated to bigquery on a specific time frame.

How can i get it? 

Please help on getting this data.

0 4 3,039
4 REPLIES 4

To accurately determine the number of records written to BigQuery within a specific time period, you should use explicit timestamps in your query. For a fixed time frame, like the one between 2023-11-01 00:00:00 UTC and 2023-11-06 09:34:00 UTC, your SQL query should look as follows:

 

SELECT COUNT(*) AS record_count
FROM `project.dataset.table`
WHERE insert_timestamp >= '2023-11-01 00:00:00 UTC'
AND insert_timestamp < '2023-11-06 09:34:00 UTC';

Make sure to replace project.dataset.table with the actual identifiers of your project, dataset, and table, and insert_timestamp with the name of the timestamp column in your schema that records the insertion time.

Auditing Changes in BigQuery

For auditing changes within BigQuery, enabling Audit Logs is a key step. These logs provide detailed information about data access and changes and can be analyzed directly within BigQuery or through the Google Cloud Console for comprehensive monitoring.

Time Zone Awareness

It's important to ensure that your queries account for the correct time zone. BigQuery assumes UTC for all timestamps without a specified time zone. You can convert timestamps to a specific time zone with BigQuery functions if needed. Here is an example that accounts for the 'America/Los_Angeles' time zone:

 

SELECT COUNT(*) AS record_count
FROM `project.dataset.table`
WHERE TIMESTAMP(insert_timestamp, 'America/Los_Angeles') >= TIMESTAMP('2023-11-01 00:00:00', 'America/Los_Angeles')
AND TIMESTAMP(insert_timestamp, 'America/Los_Angeles') < TIMESTAMP('2023-11-06 09:34:00', 'America/Los_Angeles');

Testing and Best Practices

Always test your queries to confirm they return the correct results. This is especially crucial when dealing with time zones, as an incorrect conversion could lead to inaccurate data retrieval.

Consider these additional best practices:

  • Schema Design: Incorporate timestamp columns in your BigQuery schema to track the insertion and update times of records.

  • Query Performance: Utilize partitioned tables based on timestamp columns for improved query performance and cost optimization, especially when dealing with large datasets.

  • Security and Privacy: Be aware of security and privacy concerns when enabling Audit Logs, as they may contain sensitive information that must be protected.

  • Documentation: Keep documentation of your SQL queries and maintain them to reflect any changes in the BigQuery schema or functionalities.

I need to use Bigquery SQL to get this information for all tables and datasets. I have more than 500 datasets.

Is there any other way to do this? 

Audit logs might be a good start. Please advice any other way too. Will be really helpful

To retrieve information about row insertions or modifications across all tables and datasets in BigQuery, you can consider the following approach:

  1. Use BigQuery's Client Libraries: For operational scripts and data management at scale, it's recommended to use BigQuery's client libraries in languages like Python, Node.js, or Java. These libraries provide more flexibility and control over iterating datasets and tables and executing queries programmatically.

  2. Query INFORMATION_SCHEMA Views: Use a script to iterate through your datasets and execute queries against the INFORMATION_SCHEMA.TABLES view or the INFORMATION_SCHEMA.PARTITIONS view if you use partitioned tables. Collect the creation_time, last_modified_time, and row_count for each table.

  3. Analyze BigQuery Audit Logs: Set up exports of BigQuery Audit Logs to another BigQuery dataset for easy analysis. Create a query to filter these logs for data_inserted and data_updated events within your desired timeframe.

  4. Combine Results for a Comprehensive Overview: Combine the metadata from the INFORMATION_SCHEMA views with the audit logs to understand both the current state and historical changes to your tables.