I have a Dataflow pipeline (using Apache Beam) that reads about 240 chunk files from GCS totaling ~6.82โฏGB. My goal is to merge these chunks into one (or a small number of) ~3โฏGB file(s) by using beam.CombineGlobally, but the pipeline keeps hanging during the global combine stageโeven though the dataset size is only moderate.
When testing with a smaller number of chunks, everything works fine. Iโm concerned that this might be related to Dataflow memory constraints, the shuffle phase, or some other bottleneck. How can I address these issues to make beam.CombineGlobally work reliably at this scale?
Details and progress:
Initial inputs: ~6,636,960 elements, ~6.85โฏGB
Outputs after first combine: 501 elements, ~6.82โฏGB
Outputs after middle combines: 250 elements โ 146 elements, still ~6.82โฏGB
Global combine: 146 elements โ 1 element, ~6.82โฏGB total
Final save: 1 element, ~6.82โฏGB
Despite trying various approaches, I havenโt found a clear online reference for handling large-scale data with beam.CombineGlobally. Any suggestions or best practices would be greatly appreciated.
Hi @lambs,
Welcome to the Google Cloud Community!
It seems like you're having trouble with the shuffle phase and memory constraints during the global combine stage of your Apache Beam pipeline. Here are some suggestions and best practices to help you address these issues:
For more information about Dataflow pipeline best practices, you can read this documentation.
Was this helpful? If so, please accept this answer as โSolutionโ. If you need additional assistance, reply here within 2 business days and Iโll be happy to help.