I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82 GB. My goal is to merge these chunks into one (or a small number of) ~3 GB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stage—even though the dataset size is only moderate.
When testing with a smaller number of chunks, everything works fine. I’m concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?
Details and progress:
Initial inputs: ~6,636,960 elements, ~6.85 GB
Outputs after first combine: 501 elements, ~6.82 GB
Outputs after middle combines: 250 elements → 146 elements, still ~6.82 GB
Global combine: 146 elements → 1 element, ~6.82 GB total
Final save: 1 element, ~6.82 GB
Despite trying various approaches, I haven’t found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.