Straggler Worker and Job Hang in Dataflow Pipeline... - Page 2

lambs · 04-02-2025 10:19 AM

I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82 GB. My goal is to merge these chunks into one (or a small number of) ~3 GB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stage—even though the dataset size is only moderate.

When testing with a smaller number of chunks, everything works fine. I’m concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?

Details and progress:

Initial inputs: ~6,636,960 elements, ~6.85 GB
Outputs after first combine: 501 elements, ~6.82 GB
Outputs after middle combines: 250 elements → 146 elements, still ~6.82 GB
Global combine: 146 elements → 1 element, ~6.82 GB total
Final save: 1 element, ~6.82 GB

Despite trying various approaches, I haven’t found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.

@ms4446

Straggler Worker and Job Hang in Dataflow Pipeline Using beam.CombineGlobally