Incremental tables - Page 2

janbreyta · 02-09-2024 11:22 AM

Hello. I have some questions when it comes to how to manage incremental tables. I have read several blog posts, but no-one goes into detail on what race conditions may occur. I am hoping someone here can shed some light on the topic.

What I want is to process only the new data in a table since last time I ran a job. The first obvious solution was to add a column with default value current_timestamp. But as I understand it, that timestamp is set to the query time. So if I run jobs simultaneously one job could start at say 10:00:00, add rows to the table and finish up at 10:00:20. If I run a job 10:00:10 to process everything that is new, and store the greatest timestamp, that might very well be above 10:00:00. And then after I think I have processed everything up until 10:00:10, these new entries are added to the table.

I then came across the feature Change history. That looks great. It says This feature lets you process incremental changes made to a table. But I am foggy on the details.

I can select from APPENDS. I can supply a start and end timestamp. So then I thought I can add a max timestamp, and BigQuery will crash if it is over the BQ current timestamp, not so I am afraid. I can query ahead in time, so then I need to manage that part myself.

Then I assume I need to store the greatest timestamp I have seen so far to be used in the next iteration. But is there any chance that I read and see the greatest timestamp is 10:00:10, but there will actually be added more entries later to that table before 10:00:10, or could they be added exactly at 10:00:10? Or is there some guarantee that whatever timestamp you read, you can be sure that older entries than that will never be added?

If this is a good use case I would also like to understand how the _CHANGE_TIMESTAMP if generated. I am wondering if there is a way to avoid reading all the results and get the max last timestamp. Can I for example first select current_timestamp, and use that in the max. Are these solutions synced up, or can there be some diff between standard current_timestamp and the timestamp generated and put in _CHANGE_TIMESTAMP.

I would love to learn what kind of guarantees I can expect from the Change history feature or if you have any other tips on how to implement an incremental table handling, where I can be sure to only include rows once, and not miss out on any rows.