Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

DATAFLOW: Job failed with the Shuffle feature enabled

Hello everyone!

I have this DAG in dataflow. The part where it fails is in "group main_title" which is only a beam.GroupByKey()

As I understand, when doing GroupByKey operations, the runner will enable the shuffle service, which means it will take the data out of the VMs and send it to the Dataflow backend to apply the shuffle. Why is it failing?

Job info

Job type Batch
Job status Failed
SDK version Apache Beam Python 3.8 SDK 2.37.0
Job region us-central1
Worker location us-central1
Current workers 0
Latest worker statusWorker pool stopped.
Start timeMay 4, 2023 at 12:40:06 PM GMT-5
Elapsed time1 hr 7 min
Encryption typeGoogle-managed key
Dataflow Prime Disabled
Runner v2 Enabled
Dataflow Shuffle Enabled
 
Resource Metrics
Current vCPUs 1
Total vCPU time 12.38 vCPU hr
Current memory 3.75 GB
Total memory time 46.425 GB hr
Current HDD PD 25 GB
Total HDD PD time 309.5 GB hr
Current SSD PD 0 B
Total SSD PD time 0 GB hr
Total Shuffle data processed 312.22 GB
Billable Shuffle data processed 93.61 GB
 
Step info
 
Elements added 4,798,230
Estimated size 11.37 GB

Screen Shot 2023-05-04 at 17.42.02.png

--
Best regards
David Regalado
Web | Linkedin | Twitter

0 3 1,395
3 REPLIES 3

To be able to diagnose the specific cause of the failure in your pipeline, it would be helpful to examine the error message or logs provided by Dataflow. can you post error messages that you see in the log?

@ms4446 My bad. The worker VM had to shut down one or more processes due to lack of memory.

JOB LOGSJOB LOGSDIAGNOSTICSDIAGNOSTICS

--
Best regards
David Regalado
Web | Linkedin | Twitter

Thanks confirming the error messages. "Insufficient memory" error can happen if the code running in your Dataflow job is either blocked or taking a long time to perform operations, which can lead to memory pressure. 

To resolve this issue, you can try the following steps:

  1. Increase the worker machine type: If your Dataflow job is running on VMs with limited memory resources, consider upgrading the worker machine type to VMs with more memory. This will provide additional memory capacity for your job to run without hitting memory limits.

  2. Optimize your code: Review your code to identify any potential bottlenecks or memory-intensive operations. Look for areas where you can optimize the code to reduce memory usage or improve performance. Consider using more efficient algorithms, optimizing data processing steps, or using Dataflow transformations that can handle large datasets more effectively.

  3. Adjust the parallelism: If your job is running with a high degree of parallelism, reducing the parallelism can help reduce memory usage per worker. You can try adjusting the numWorkers and maxNumWorkers parameters to control the number of workers in your Dataflow job. Be cautious not to set it too low, as it may impact the overall job performance.

  4. Increase the worker disk size: If your job is processing large datasets or generating intermediate data that consumes a significant amount of disk space, increasing the worker disk size can help alleviate memory pressure. Larger disk sizes provide more temporary storage space for intermediate data, reducing the need for excessive memory usage.

  5. Enable autoscaling: Consider enabling autoscaling for your Dataflow job. Autoscaling automatically adjusts the number of workers based on the current workload, which can help manage memory usage more effectively. Dataflow can dynamically scale up or down the number of workers based on the input data size and processing requirements.

  6. Split your job into smaller steps: If your Dataflow job has a complex pipeline with multiple stages or transformations, consider breaking it down into smaller, more manageable steps. This approach can reduce the memory requirements for each individual step and make it easier to isolate and troubleshoot potential memory issues.