What guarantees exist for detailed data exports?

Hi,

When exporting cloud billing data to BigQuery, with detailed data exports enabled, are past billing records ever removed or modified in anyway by future data exports?

I had assumed no, but ran into a situation where an export time (from a billing record I had previously downloaded from BigQuery) no longer has any billing records associated with it in my billing data table. This is how I searched for it:

 

SELECT *
FROM <Billing data table>
WHERE export_time = TIMESTAMP_MICROS(<unix timestmap>);

 

If they are modified, is the export time for that billing record updated to the current export time? Is there any way to search for modified records? If records are deleted/replaced, is there a way to map the old records we've downloaded to the new records?


Solved Solved
0 3 532
1 ACCEPTED SOLUTION

I got help from another support channel and thought I'd share my findings here. There's some undocumented behavior happening here:

  • All BigQuery billing tables are partitioned by day, on the column _PARTITIONTIME
  • That table is written to multiple times a day with new records for the current day.
  • All writes to this table write in a Write_truncate affinity
    • This means when new billing records are written for a partition day, all current billing records associated with that partition day (_PARTITIONTIME) are deleted and rewritten
    • All billing records written or rewritten in this write get a new export time, which is guaranteed to be latest export time in the table, at the time of this write


What was happening in my case: I was pulling in billing records from the current day, multiple times a day. The export times that I found in my local records, that I could no longer find in the BigQuery billing records table, were from records that were overwritten with new versions of the same records, with a new export time

View solution in original post

3 REPLIES 3

I can't say for sure but I would be surprised if Google updated previously exported billing table data.  My thinking is that customers can use this data for charge backs and other items and should trust data contained within as being immutable.  Since these tables are *just* tables, we have another thing we can look at.  Changes to BigQuery tables are always written as audit log records to Cloud Logging.  I just looked at my own exports in Cloud Logging and could quickly find the records in logging that were generated because GCP itself had inserted new records in the billing tables.   If something/someone had changes the tables, we should see that in an audit trail for the table.  However, occam's simplest is that "no" the tables are not updated.  So that takes us back to your original puzzle.  You show a select from a billing table for a given export time.  That's pretty specific ... what is making you think that you should get any records back for that.   Maybe try a bounded search where export_time is > lower value and < upper_value as opposed to a VERY specific exact timestamp.

Hi Kolban,

re: why I'm looking for a specific export time:

My organization is working on monitoring our cloud costs, and as part of that effort we are exporting our billing information from our BigQuery table to a gcs bucket. We pull in new data based on the last known latest export time. It looks something like this:

query = f"""
EXPORT DATA OPTIONS(
  uri='gs://{gcs_bucket}/{current_drop_path}/data-*.parquet',
  format='PARQUET',
  overwrite=true) AS
SELECT *
FROM {billing_data_table}
WHERE invoice.month = '{invoice_month}' AND 'export_time > {latest_export_unix_time};
"""

While kicking the tires on this service, we noticed some discrepancies. In particular, we noticed that after a reingest, export times that existed in previously exported billing records did not exist in the data we reingested. That's what prompted the spot check I shared in my original post. The fact that we can't find billing records associated with the export times of previously ingested billing records is very alarming, as our system currently relies on the notion that data currently in the BigQuery billing data table is immutable.

I'll poke around Cloud Logging to see if I find anything interesting. Do you have any tips as to how I should search those logs?

I got help from another support channel and thought I'd share my findings here. There's some undocumented behavior happening here:

  • All BigQuery billing tables are partitioned by day, on the column _PARTITIONTIME
  • That table is written to multiple times a day with new records for the current day.
  • All writes to this table write in a Write_truncate affinity
    • This means when new billing records are written for a partition day, all current billing records associated with that partition day (_PARTITIONTIME) are deleted and rewritten
    • All billing records written or rewritten in this write get a new export time, which is guaranteed to be latest export time in the table, at the time of this write


What was happening in my case: I was pulling in billing records from the current day, multiple times a day. The export times that I found in my local records, that I could no longer find in the BigQuery billing records table, were from records that were overwritten with new versions of the same records, with a new export time