Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Straggler Worker and Job Hang in Dataflow Pipeline Using beam.CombineGlobally

I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82 GB. My goal is to merge these chunks into one (or a small number of) ~3 GB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stage—even though the dataset size is only moderate.

When testing with a smaller number of chunks, everything works fine. I’m concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?

Details and progress:

  • Initial inputs: ~6,636,960 elements, ~6.85 GB

  • Outputs after first combine: 501 elements, ~6.82 GB

  • Outputs after middle combines: 250 elements → 146 elements, still ~6.82 GB

  • Global combine: 146 elements → 1 element, ~6.82 GB total

  • Final save: 1 element, ~6.82 GB

Despite trying various approaches, I haven’t found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.



lambs_0-1743613301242.png

 


@ms4446

 
0 1 148
1 REPLY 1

Hi @lambs

Welcome to the Google Cloud Community!

It seems like you're having trouble with the shuffle phase and memory constraints during the global combine stage of your Apache Beam pipeline. Here are some suggestions and best practices to help you address these issues:

  1. Adjust Dataflow Worker Settings - Configure your Dataflow workers with higher memory capacity to handle large datasets. Enable autoscaling to dynamically allocate more workers during the shuffle phase, which can help distribute the load.
  2. Optimize Your Combiner Function - Ensure the program avoids duplicating data or holding large amounts of it in memory unnecessarily. Instead, process the data in smaller chunks to keep things running smoothly without overloading the system.
  3. Dataflow Monitoring - Monitor CPU usage, memory usage, and shuffle performance. Look for bottlenecks and areas where your pipeline is spending the most time.

For more information about Dataflow pipeline best practices, you can read this documentation.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.