Bigquery data versioning

rcnespoli · 01-16-2023 10:47 AM

Hi now a day I'm using this flow to keep data versioning.

The application publish a message to PubSub and DataFlow stream the data from PubSub to "intermediate" BigQuery.
So I have a store procedure that is called by cronjob to get new data and calculate the history and write to "final" BigQuery.

So what's "calculate history"? An example: The application publish something like this:

{
"createdAt": "2023-01-11T17:15:17Z",
"name": "john",
"id": "123",
"gender": null
}

So DataFlow will stream to "intermediate" BigQuery as data is. The store procedure will get this and store to "final" BigQuery something like this:

createdAt	name	id	gender	rowEffectiveFrom	rowEffectiveTo
2023-01-11T17:15:17Z	john	123	null	2023-01-11	9999-12-01

So application publish another message something like this:

{
"createdAt": "2023-01-12T09:00:11Z",
"name": "john doe",
"masterId": "123",
"gender": null
}

So DataFlow will stream to "intermediate" BigQuery as data is. The store procedure will get this data from "intermediate" BigQuery and store to "final" BigQuery something like this:

createdAt	name	id	gender	rowEffectiveFrom	rowEffectiveTo
2023-01-11T17:15:17Z	john	123	null	2023-01-11	2023-01-12
2023-01-12T09:11:11Z	john doe	123	null	2023-01-12	9999-12-01

I'm using Partition config inside BigQuery, using rowEffective field as Partition, to be cheaper when run queries. And to data analytics team can use rowEffectiveTo = "9999-12-01" to get the newer value from new accounts.

I saw that I can remove DataFlow component and write directly to BigQuery from Pubsub. I tried to do this removing DataFlow, "intermediate" BigQuery, store procedure, cronjob (and not using rowEffectiveFrom and rowEffectiveTo fields) and use this query to get the newer account value:

SELECT AS VALUE ARRAY_AGG(t ORDER BY createdAt DESC LIMIT 1)[OFFSET(0)]
FROM `my-poc` t
GROUP BY id

But I saw that this query will be very expensive in the future, because this query look into all rows to get the last data.

Is there another solution that I can keep historical data ?

References:

https://towardsdatascience.com/slowly-changing-dimension-type-2-with-google-bigquery-749e0f9107fb

https://medium.com/codex/collaborate-better-with-data-versioning-566c2299c435

nceniza

By accessing the historical data did you meant time travel? BigQuery has a feature called Time travel [1] by specifying a time travel window you can access data that has changes at any point of time at the specified time travel window.

[1]https://cloud.google.com/bigquery/docs/time-travel

rcnespoli

@nceniza Yes, It's like time travel and I need long time, but at link that you shared I saw that I need to use snapshots.
Streaming from PubSub to BigQuery how can I update row instead of append one more row?