Hi now a day I'm using this flow to keep data versioning.
The application publish a message to PubSub and DataFlow stream the data from PubSub to "intermediate" BigQuery.
So I have a store procedure that is called by cronjob to get new data and calculate the history and write to "final" BigQuery.
So what's "calculate history"? An example: The application publish something like this:
{
"createdAt": "2023-01-11T17:15:17Z",
"name": "john",
"id": "123",
"gender": null
}
So DataFlow will stream to "intermediate" BigQuery as data is. The store procedure will get this and store to "final" BigQuery something like this:
createdAt | name | id | gender | rowEffectiveFrom | rowEffectiveTo |
2023-01-11T17:15:17Z | john | 123 | null | 2023-01-11 | 9999-12-01 |
So application publish another message something like this:
{
"createdAt": "2023-01-12T09:00:11Z",
"name": "john doe",
"masterId": "123",
"gender": null
}
So DataFlow will stream to "intermediate" BigQuery as data is. The store procedure will get this data from "intermediate" BigQuery and store to "final" BigQuery something like this:
createdAt | name | id | gender | rowEffectiveFrom | rowEffectiveTo |
2023-01-11T17:15:17Z | john | 123 | null | 2023-01-11 | 2023-01-12 |
2023-01-12T09:11:11Z | john doe | 123 | null | 2023-01-12 | 9999-12-01 |
I'm using Partition config inside BigQuery, using rowEffective field as Partition, to be cheaper when run queries. And to data analytics team can use rowEffectiveTo = "9999-12-01" to get the newer value from new accounts.
I saw that I can remove DataFlow component and write directly to BigQuery from Pubsub. I tried to do this removing DataFlow, "intermediate" BigQuery, store procedure, cronjob (and not using rowEffectiveFrom and rowEffectiveTo fields) and use this query to get the newer account value:
SELECT AS VALUE ARRAY_AGG(t ORDER BY createdAt DESC LIMIT 1)[OFFSET(0)]
FROM `my-poc` t
GROUP BY id
But I saw that this query will be very expensive in the future, because this query look into all rows to get the last data.
Is there another solution that I can keep historical data ?
References:
https://towardsdatascience.com/slowly-changing-dimension-type-2-with-google-bigquery-749e0f9107fb
https://medium.com/codex/collaborate-better-with-data-versioning-566c2299c435
By accessing the historical data did you meant time travel? BigQuery has a feature called Time travel [1] by specifying a time travel window you can access data that has changes at any point of time at the specified time travel window.
@nceniza Yes, It's like time travel and I need long time, but at link that you shared I saw that I need to use snapshots.
Streaming from PubSub to BigQuery how can I update row instead of append one more row?