Announcements
The Google Cloud Community will be in read-only from July 16 - July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Straggler Worker and Job Hang in Dataflow Pipeline Using beam.CombineGlobally

I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82โ€ฏGB. My goal is to merge these chunks into one (or a small number of) ~3โ€ฏGB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stageโ€”even though the dataset size is only moderate.

When testing with a smaller number of chunks, everything works fine. Iโ€™m concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?

Details and progress:

  • Initial inputs: ~6,636,960 elements, ~6.85โ€ฏGB

  • Outputs after first combine: 501 elements, ~6.82โ€ฏGB

  • Outputs after middle combines: 250 elements โ†’ 146 elements, still ~6.82โ€ฏGB

  • Global combine: 146 elements โ†’ 1 element, ~6.82โ€ฏGB total

  • Final save: 1 element, ~6.82โ€ฏGB

Despite trying various approaches, I havenโ€™t found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.



lambs_0-1743613301242.png

 


@ms4446

 
0 1 375
1 REPLY 1

MJane
Former Googler

Hi @lambs

Welcome to the Google Cloud Community!

It seems like you're having trouble with the shuffle phase and memory constraints during the global combine stage of your Apache Beam pipeline. Here are some suggestions and best practices to help you address these issues:

  1. Adjust Dataflow Worker Settings - Configure your Dataflow workers with higher memory capacity to handle large datasets. Enable autoscaling to dynamically allocate more workers during the shuffle phase, which can help distribute the load.
  2. Optimize Your Combiner Function - Ensure the program avoids duplicating data or holding large amounts of it in memory unnecessarily. Instead, process the data in smaller chunks to keep things running smoothly without overloading the system.
  3. Dataflow Monitoring - Monitor CPU usage, memory usage, and shuffle performance. Look for bottlenecks and areas where your pipeline is spending the most time.

For more information about Dataflow pipeline best practices, you can read this documentation.

Was this helpful? If so, please accept this answer as โ€œSolutionโ€. If you need additional assistance, reply here within 2 business days and Iโ€™ll be happy to help.