Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Issue in windowing with large bigquery table read

Hi,
This post is in continuation with https://www.googlecloudcommunity.com/gc/Data-Analytics/Dataflow-Stateful-processing-Issue-worker-fai... . In dataflow pipeline I need to read data from bigquery using custom DoFn and writing to cloud sql. I found some of the table are having large dataset in several millions range, For these heavy table I was facing challenges to track status of  written batches batches so as suggested in previous post I  use external system/storage for maintaining the state of these batches and was sending an event with every batch commit/write in cloudsql. 

But this process was not optimized as there will be lots of events in external system so for optimizing the process and implemented the windows ( let's say 2 min ) and after grouping  I had tried to send events but the pervious issue starts re-occurring.    Now data is not being captured in window until previous stage data gets commit ( it wait for all the data to read from table and write to cloud sql) , which is large and will take time.  I am using bigquery client to read from bigquery and read data in batches. I did some research and found that Splitable DoFn will help here but I find limited resource on that..

Can someone please provide reference documents / articles which I can refer to read data from bigquery using splitable DoFn

@ms4446 @mohitshagcp 

 

0 2 547
2 REPLIES 2