I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82 GB. My goal is to merge these chunks into one (or a small number of) ~3 GB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stage—even though the dataset size is only moderate.
When testing with a smaller number of chunks, everything works fine. I’m concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?
Details and progress:
Initial inputs: ~6,636,960 elements, ~6.85 GB
Outputs after first combine: 501 elements, ~6.82 GB
Outputs after middle combines: 250 elements → 146 elements, still ~6.82 GB
Global combine: 146 elements → 1 element, ~6.82 GB total
Final save: 1 element, ~6.82 GB
Despite trying various approaches, I haven’t found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.
Hi @lambs,
Welcome to the Google Cloud Community!
It seems like you're having trouble with the shuffle phase and memory constraints during the global combine stage of your Apache Beam pipeline. Here are some suggestions and best practices to help you address these issues:
For more information about Dataflow pipeline best practices, you can read this documentation.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.