Total logical bytes remain same when I replace my data present in BQ (loading from Datastore to GCS to bq). But Total physical bytes ranges from 4 MB to 480 MB. Why is it happening like that? How to calculate storage and query cost in this case. These are the command I am using.
gsutil rm -r gs://datastore_exports_to_bigquery/$1_$2
gcloud datastore export --kinds=$2 --namespaces=$1 gs://datastore_exports_to_bigquery/$1_$2
bq load --source_format=DATASTORE_BACKUP \
--replace \
analytics.coverage \
gs://datastore_exports_to_bigquery/$1_$2/namespace_$1/kind_$2/namespace_$1_kind_$2.export_metadata
gsutil rm -r gs://datastore_exports_to_bigquery/$1_$2
Solved! Go to Solution.
My understanding is that what you are seeing is the concept of BigQuery time-travel. When you modify a table, the previous version of the table is also maintained so that you can return to it. For example, if you accidentally ran a BQ load with bad data, you could un-do that load and return to an older version. To do that, Google has to maintain previous incarnations of the data. I believe that the default (as of 2023-02) is 7 days worth of changes. The physical data is thus the amount of storage needed to host your table and its previous versions. After the time travel window rolls on, data beyond that period is deleted/lost.
From a cost and query charge perspective, you are charged for logical storage size and not physical storage size. When you perform a query, unless you explicitly include FOR SYSTEM_TIME AS OF ... in your query, the scan of data only looks at the current version of the table and ignores all the data for the time-travel access.
My understanding is that what you are seeing is the concept of BigQuery time-travel. When you modify a table, the previous version of the table is also maintained so that you can return to it. For example, if you accidentally ran a BQ load with bad data, you could un-do that load and return to an older version. To do that, Google has to maintain previous incarnations of the data. I believe that the default (as of 2023-02) is 7 days worth of changes. The physical data is thus the amount of storage needed to host your table and its previous versions. After the time travel window rolls on, data beyond that period is deleted/lost.
From a cost and query charge perspective, you are charged for logical storage size and not physical storage size. When you perform a query, unless you explicitly include FOR SYSTEM_TIME AS OF ... in your query, the scan of data only looks at the current version of the table and ignores all the data for the time-travel access.
Then I guess the time travel data should increase only. But in some cases it goes down also from 480 MB to 50MB. Not sure why. Also I need to check if are using logical bytes billing or physical bytes billing. Thanks for the answer.
After 7 days, the 8th day of recoverable data is deleted and I would expect the physical data to decrease. This is a rolling window. My belief then is that if you left a table unmodified for 7 days, the only physical storage in place would be that to hold the current table itself. Also, the physical storage is compressed. I don't know what compression algorithms are used but, in theory, if we added more data, then it is possible that the compression algorithms might even find a pattern that would result in an overall lower compression.