I'm just new to bigquery and have just been slapped with bill for $1400 for 9 days of queries on my workspaces logs. I have a script pulling back my workspace gmail logs every minute with the following
SELECT gmail, CURRENT_TIMESTAMP(), TIMESTAMP_MICROS(gmail.event_info.timestamp_usec)
FROM `workspace-logs-XXXXX.wsl.activity`
WHERE TIMESTAMP(TIMESTAMP_MICROS(gmail.event_info.timestamp_usec)) > '2023-12-16 11:11:52.776000'
ORDER BY gmail.event_info.timestamp_usec ASC;
It's costing me a fortune so looking for ways to optimise this query. Sure can query less regularly but is there anything else I can do.
More info for each query
Duration 1 sec
Bytes processed 22.12 GB
Bytes billed 22.12 GB
milliseconds 599902
Solved! Go to Solution.
Looking at your query, you seem to be retrieving all records that are after some date (2023-12-16). You are telling me that you are doing this once a minute ... for every minute. This feels like you are returning the same records over and over again as opposed to returning just the new / changed records. In the On-Demand billing model ... Google charges $6.25 for every Terrabyte scanned. Looking at your query and the data you are showing ... we have:
22.12GB/1TB * 60 calls/hour * 24 hours * $6.25 = $194.41 / day => ~ $1749 for 9 days
The way to reduce your costs are:
1. Make fewer calls
2. Reduce the data scanned per call
What is the schema/structure of the table you are querying? I'd be looking to see if the table is partitioned. Partitioning means that when you perform a query you only need consider a subset of the data. I'd also be interested in hearing about how the table is being updated. The thought that keeps coming to me is that the data being returned seems to be the same data over and over.
Let's also consider a value proposition ... if you consider $194/day a poor value, what would you consider a good value? How much time and energy are you willing to invest to change the situation (1 year would be ~$70K ... so it feels like a good investment).
I think if I'm reading this article correctly ... I am seeing the following:
BigQuery Exports collect Google Workspace data from the previous day's events. The result shows data from the previous day to the export date.
This seems to say to me that the audit log data is written to BigQuery once per day and doesn't then change for a 24 hour period. I'd be tempted to suggest that you run a scheduled job, once per day .... that does the equivalent of:
CREATE TABLE <archive_root>_YYYY_MM_DD AS (
SELECT * FROM <AuditTable> WHERE <AuditRecordDate> >= YYYY_MM__DD
)
You can use Cloud Scheduler, BigQuery Scheduled Queries, or Cloud Composer (or other) to schedule and run the job.
Looking at your query, you seem to be retrieving all records that are after some date (2023-12-16). You are telling me that you are doing this once a minute ... for every minute. This feels like you are returning the same records over and over again as opposed to returning just the new / changed records. In the On-Demand billing model ... Google charges $6.25 for every Terrabyte scanned. Looking at your query and the data you are showing ... we have:
22.12GB/1TB * 60 calls/hour * 24 hours * $6.25 = $194.41 / day => ~ $1749 for 9 days
The way to reduce your costs are:
1. Make fewer calls
2. Reduce the data scanned per call
What is the schema/structure of the table you are querying? I'd be looking to see if the table is partitioned. Partitioning means that when you perform a query you only need consider a subset of the data. I'd also be interested in hearing about how the table is being updated. The thought that keeps coming to me is that the data being returned seems to be the same data over and over.
Let's also consider a value proposition ... if you consider $194/day a poor value, what would you consider a good value? How much time and energy are you willing to invest to change the situation (1 year would be ~$70K ... so it feels like a good investment).
I'm very new to bigquery so bare with me, learning as I type. The table is populated using workspaces log export feature https://support.google.com/a/answer/9079365?hl=en which looks to be importing once per hour. yea see that my query should never of been once per minute if the source data is only imported hourly. It's partitioned for each day too.
Thinking query only after import runs for any records created in the last hour from today's partitions.
Goal is simply to get the logs out of BigQuery for archiving.
I think if I'm reading this article correctly ... I am seeing the following:
BigQuery Exports collect Google Workspace data from the previous day's events. The result shows data from the previous day to the export date.
This seems to say to me that the audit log data is written to BigQuery once per day and doesn't then change for a 24 hour period. I'd be tempted to suggest that you run a scheduled job, once per day .... that does the equivalent of:
CREATE TABLE <archive_root>_YYYY_MM_DD AS (
SELECT * FROM <AuditTable> WHERE <AuditRecordDate> >= YYYY_MM__DD
)
You can use Cloud Scheduler, BigQuery Scheduled Queries, or Cloud Composer (or other) to schedule and run the job.
I assume you are using On-demand pricing model with that high cost. That would be too costly since it is billing by bytes processed.
For query optimization, you can apply partitioning in your timestamp column. But, when you query, it is better not to apply those timestamp functions on the column (to make use of the partition). Instead, format the value you are comparing to the data type/format of your column. The partition will help reduce the data scanned but with the function it will not make use of the partition.
Thanks for providing such a detailed response @kolban - I hope this helps out @Jamsambo - we definitely do not want you to be in a situation where you (and your org) are paying for compute power that is not serving your needs efficiently. Hopefully the suggestions will help you realize a savings soon!