Solved: Re: Hitting OOM issues when increasing compute res...

jfishbein · 09-20-2023 05:53 AM

Hi,

I have a job that runs with decent consistency using docker container runnables and c2-standard-16 instances where I specify both the instance type and compute resources. When we set the computeResource cpuMilli to be 4vCPU and the memoryMib to be 16GB, our job seems to successfully complete with high frequency. When we increase the computeResource to 8vCPU and 32GB RAM, our jobs get incredibly slow (runtime goes from roughly 2min to upwards of 30min with some tasks failing with an instance unresponsive/OOM). Looking at our quotas, I can see that we're not anywhere close to running into limits. When I look at the Batch Agent Logs for the case of increased CPU/RAM resources, I notice a lot of the following messages:

rpc error: code = Unavailable desc = 502:Bad Gateway. Retrying in 3.09654082s

When SSHing onto instances in this case, we can see some instances have a ton of docker containers running on them (20+) while others have 0. Running docker stats on the VM instances with 20+ containers, we notice that the CPU usage is well over 100% combined across containers and that the RAM is dangerously close its threshold as well. In the case with increased CPU/RAM requirements and the instance type, I would have expected at most 2 tasks per VM instance.

Additionally, we've tried setting the taskCountPerNode to be 2 in addition to the compute resources and instance type and even so, when I SSH onto VM instances I still notice far more than 2 containers actively executing per instance.

This leads to a few questions:

1) What is the relationship between the instance machine type, compute resources, and task count per node?

2) Is there a different way we need to be setting resource/instance parameters to help manage how many tasks are getting sent to each instance? I'm a bit confused why increasing the task resources seems to be causing more tasks to be allocated per VM instance rather than less

3) Is anyone else running into this issue? I'm pretty confident that we were able to change the compute resources parameters in the past without issue, however, more recently, this consistently results in significantly slowed job execution and potentially task failures with OOM issues.

bolianyin

@jfishbein Thanks reporting the issue. We believe this is related to a recent bug and the fix is being rolled out. If everything goes as planned, the issue should be fixed within one or two days. Sorry for the inconvience, please try again later.

View solution in original post

bolianyin

@jfishbein Thanks reporting the issue. We believe this is related to a recent bug and the fix is being rolled out. If everything goes as planned, the issue should be fixed within one or two days. Sorry for the inconvience, please try again later.

Hitting OOM issues when increasing compute resources on Batch