Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Straggler Worker and Job Hang in Dataflow Pipeline Using beam.CombineGlobally

I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82 GB. My goal is to merge these chunks into one (or a small number of) ~3 GB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stage—even though the dataset size is only moderate.

When testing with a smaller number of chunks, everything works fine. I’m concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?

Details and progress:

  • Initial inputs: ~6,636,960 elements, ~6.85 GB

  • Outputs after first combine: 501 elements, ~6.82 GB

  • Outputs after middle combines: 250 elements → 146 elements, still ~6.82 GB

  • Global combine: 146 elements → 1 element, ~6.82 GB total

  • Final save: 1 element, ~6.82 GB

Despite trying various approaches, I haven’t found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.



lambs_0-1743613301242.png

 


@ms4446

 
0 1 350
1 REPLY 1