Hi all!
I have created a Google Batch job with our custom Docker image and a command that I run on a binary, which is uploaded to Cloud Storage first.
Sometimes the memory, that I requested fro the job is not enough and I get
ERROR 2023-12-16T08:54:45.616547576Z File "/workdir/.pyenv/versions/3.10.12/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__
ERROR 2023-12-16T08:54:45.616836128Z sl = self._semlock = _multiprocessing.SemLock(
ERROR 2023-12-16T08:54:45.616845582Z OSError: [Errno 28] No space left on device
However, the job keeps running, even though it seems that the application is actually corrupt and from this point on the logs only output the errors:
ERROR 2023-12-16T08:54:53.326941828Z [fc68fbcdea7d:2911082] *** End of error message ***
ERROR 2023-12-16T08:54:53.408045636Z [fc68fbcdea7d:2911080] *** Process received signal ***
ERROR 2023-12-16T08:54:53.408134705Z [fc68fbcdea7d:2911080] Signal: Bus error (7)
ERROR 2023-12-16T08:54:53.408148494Z [fc68fbcdea7d:2911080] Signal code: Non-existant physical address (2)
ERROR 2023-12-16T08:54:53.408162601Z [fc68fbcdea7d:2911080] Failing at address: 0x7c67e97e1000
ERROR 2023-12-16T08:54:53.408353586Z [fc68fbcdea7d:2911080] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7c67e9d36520]
ERROR 2023-12-16T08:54:53.408448443Z [fc68fbcdea7d:2911080] [ 1] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x5b1790c)[0x7c6787c7290c]
ERROR 2023-12-16T08:54:53.408459766Z [fc68fbcdea7d:2911080] [ 2] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x33e293d)[0x7c678553d93d]
ERROR 2023-12-16T08:54:53.408467981Z [fc68fbcdea7d:2911080] [ 3] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x8e54cf)[0x7c6782a404cf]
ERROR 2023-12-16T08:54:53.408478080Z [fc68fbcdea7d:2911080] [ 4] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x8b7e31)[0x7c6782a12e31]
ERROR 2023-12-16T08:54:53.408489445Z [fc68fbcdea7d:2911080] [ 5] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(+0x110553)[0x7c67ea034553]
ERROR 2023-12-16T08:54:53.408499630Z [fc68fbcdea7d:2911080] [ 6] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x8c)[0x7c67e9feae8c]
ERROR 2023-12-16T08:54:53.408522510Z [fc68fbcdea7d:2911080] [ 7] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x8024)[0x7c67e9f96f74]
ERROR 2023-12-16T08:54:53.408677819Z [fc68fbcdea7d:2911080] [ 8] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(+0x1b9c34)[0x7c67ea0ddc34]
ERROR 2023-12-16T08:54:53.408701982Z [fc68fbcdea7d:2911080] [ 9] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x8d77)[0x7c67e9f97cc7]
...
I would like jobs to always get terminated in this case, but I could not find any pointers why this does not happen automatically.
Thanks!
At the moment, Batch does not have a mechanism to automatically fail a job as it is running out of memory. For the time being, we recommend you to either manually delete the job to terminate the job. Another automated alternative could be an event based trigger to delete the job based on the Cloud Logging entry (reference).
Thank you for the answer! We would like to be at least automatically notified if this happens. Is the Cloud Logging integration only way to achieve that?