Hi,
This post is in continuation with https://www.googlecloudcommunity.com/gc/Data-Analytics/Dataflow-Stateful-processing-Issue-worker-fai... . In dataflow pipeline I need to read data from bigquery using custom DoFn and writing to cloud sql. I found some of the table are having large dataset in several millions range, For these heavy table I was facing challenges to track status of written batches batches so as suggested in previous post I use external system/storage for maintaining the state of these batches and was sending an event with every batch commit/write in cloudsql.
But this process was not optimized as there will be lots of events in external system so for optimizing the process and implemented the windows ( let's say 2 min ) and after grouping I had tried to send events but the pervious issue starts re-occurring. Now data is not being captured in window until previous stage data gets commit ( it wait for all the data to read from table and write to cloud sql) , which is large and will take time. I am using bigquery client to read from bigquery and read data in batches. I did some research and found that Splitable DoFn will help here but I find limited resource on that..
Can someone please provide reference documents / articles which I can refer to read data from bigquery using splitable DoFn
Here are some resources and considerations for using Splittable DoFns, especially in the context of reading data from BigQuery:
Thank you so much @ms4446 for your inputs !