DATAFLOW: Job failed with the Shuffle feature enab...

davidregalado25 · 05-04-2023 03:47 PM

Hello everyone!

I have this DAG in dataflow. The part where it fails is in "group main_title" which is only a beam.GroupByKey()

As I understand, when doing GroupByKey operations, the runner will enable the shuffle service, which means it will take the data out of the VMs and send it to the Dataflow backend to apply the shuffle. Why is it failing?

Job info

Job type Batch

Job status Failed

SDK version Apache Beam Python 3.8 SDK 2.37.0

Job region us-central1

Worker location us-central1

Current workers 0

Latest worker statusWorker pool stopped.

Start timeMay 4, 2023 at 12:40:06 PM GMT-5

Elapsed time1 hr 7 min

Encryption typeGoogle-managed key

Dataflow Prime Disabled

Runner v2 Enabled

Dataflow Shuffle Enabled

Resource Metrics

Current vCPUs 1

Total vCPU time 12.38 vCPU hr

Current memory 3.75 GB

Total memory time 46.425 GB hr

Current HDD PD 25 GB

Total HDD PD time 309.5 GB hr

Current SSD PD 0 B

Total SSD PD time 0 GB hr

Total Shuffle data processed 312.22 GB

Billable Shuffle data processed 93.61 GB

Step info

Elements added 4,798,230

Estimated size 11.37 GB

--
Best regards
David Regalado
Web | Linkedin | Twitter

ms4446

To be able to diagnose the specific cause of the failure in your pipeline, it would be helpful to examine the error message or logs provided by Dataflow. can you post error messages that you see in the log?

davidregalado25

@ms4446 My bad. The worker VM had to shut down one or more processes due to lack of memory.

JOB LOGSDIAGNOSTICS

--
Best regards
David Regalado
Web | Linkedin | Twitter

ms4446

Thanks confirming the error messages. "Insufficient memory" error can happen if the code running in your Dataflow job is either blocked or taking a long time to perform operations, which can lead to memory pressure.

To resolve this issue, you can try the following steps:

Increase the worker machine type: If your Dataflow job is running on VMs with limited memory resources, consider upgrading the worker machine type to VMs with more memory. This will provide additional memory capacity for your job to run without hitting memory limits.
Optimize your code: Review your code to identify any potential bottlenecks or memory-intensive operations. Look for areas where you can optimize the code to reduce memory usage or improve performance. Consider using more efficient algorithms, optimizing data processing steps, or using Dataflow transformations that can handle large datasets more effectively.
Adjust the parallelism: If your job is running with a high degree of parallelism, reducing the parallelism can help reduce memory usage per worker. You can try adjusting the numWorkers and maxNumWorkers parameters to control the number of workers in your Dataflow job. Be cautious not to set it too low, as it may impact the overall job performance.
Increase the worker disk size: If your job is processing large datasets or generating intermediate data that consumes a significant amount of disk space, increasing the worker disk size can help alleviate memory pressure. Larger disk sizes provide more temporary storage space for intermediate data, reducing the need for excessive memory usage.
Enable autoscaling: Consider enabling autoscaling for your Dataflow job. Autoscaling automatically adjusts the number of workers based on the current workload, which can help manage memory usage more effectively. Dataflow can dynamically scale up or down the number of workers based on the input data size and processing requirements.
Split your job into smaller steps: If your Dataflow job has a complex pipeline with multiple stages or transformations, consider breaking it down into smaller, more manageable steps. This approach can reduce the memory requirements for each individual step and make it easier to isolate and troubleshoot potential memory issues.

DATAFLOW: Job failed with the Shuffle feature enabled

Job info