Hi,
I have implemented dataflow pipeline to replicate data from BQ to cloudsql. In Bigquery there are tables which having billions of rows, I read data from bigquery table in paginated way and emit pages data to write for next stage/transformation (new DoFn to write data in parallel) .
Everything was working fine from dynamic reading BQ table table data and writing data dynamically to cloud sql. The problem starts with status tracking for all the pages/batches of large bigquery table that we are continuously emits to write so that we came to know whether all the batches/page of table we read had been write by next stage/transformation(new DoFn to write) or not.
I have implemented stateful function in pipeline to write data to track the all the batches of table so that I can notify system that data replication for table had been completed but problem starts from here..
In reading stage from BQ for table earlier we were reading data in batches and emitting those batches to write by next stage/transformation but with stateful processing data emitting for batch/pages to next stage/transformation stops until it read all the pages and create corresponding batches to emit. In fact it emit but next stage never receive it and worker goes out memory as large BQ tables can be in size 10 GB- 100 GB.
Is it the way of working for beam dataflow with stateful processing ? Can someone please suggest, what are the other possible/better way of maintaining the status on dataflow job level which will be shared among worker.
Solved! Go to Solution.
Your understanding of side inputs in Apache Beam is correct. They are indeed immutable and read-only, loaded by each worker before processing a DoFn
. This makes them suitable for providing additional input data but not for tracking mutable state or progress across a dynamic, streaming pipeline.
Given your requirement to track the progress of batch writes for multiple tables in a streaming context, here are some suggestions:
External State Management:
Data-Driven Triggers in Beam:
Stateful DoFn with External Checkpoints:
Monitoring and Aggregation:
For your use case, a combination of external state management (using a database or Cloud Storage) and custom triggers or monitoring might be the most effective approach. This setup allows you to offload the mutable state from the Dataflow workers, thereby avoiding the memory issues associated with large stateful operations in Beam, while still being able to track the progress of your batches dynamically.