Hi all!
I have created a Google Batch job with our custom Docker image and a command that I run on a binary, which is uploaded to Cloud Storage first.
Sometimes the memory, that I requested fro the job is not enough and I get
ERROR 2023-12-16T08:54:45.616547576Z File "/workdir/.pyenv/versions/3.10.12/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__
ERROR 2023-12-16T08:54:45.616836128Z sl = self._semlock = _multiprocessing.SemLock(
ERROR 2023-12-16T08:54:45.616845582Z OSError: [Errno 28] No space left on device
However, the job keeps running, even though it seems that the application is actually corrupt and from this point on the logs only output the errors:
ERROR 2023-12-16T08:54:53.326941828Z [fc68fbcdea7d:2911082] *** End of error message ***
ERROR 2023-12-16T08:54:53.408045636Z [fc68fbcdea7d:2911080] *** Process received signal ***
ERROR 2023-12-16T08:54:53.408134705Z [fc68fbcdea7d:2911080] Signal: Bus error (7)
ERROR 2023-12-16T08:54:53.408148494Z [fc68fbcdea7d:2911080] Signal code: Non-existant physical address (2)
ERROR 2023-12-16T08:54:53.408162601Z [fc68fbcdea7d:2911080] Failing at address: 0x7c67e97e1000
ERROR 2023-12-16T08:54:53.408353586Z [fc68fbcdea7d:2911080] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7c67e9d36520]
ERROR 2023-12-16T08:54:53.408448443Z [fc68fbcdea7d:2911080] [ 1] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x5b1790c)[0x7c6787c7290c]
ERROR 2023-12-16T08:54:53.408459766Z [fc68fbcdea7d:2911080] [ 2] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x33e293d)[0x7c678553d93d]
ERROR 2023-12-16T08:54:53.408467981Z [fc68fbcdea7d:2911080] [ 3] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x8e54cf)[0x7c6782a404cf]
ERROR 2023-12-16T08:54:53.408478080Z [fc68fbcdea7d:2911080] [ 4] /root/.pex/venvs/ad20851edb19e44225445ad8dc18b74d77b6905a/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/lib/python3.10/site-packages/jaxlib/xla_extension.so(+0x8b7e31)[0x7c6782a12e31]
ERROR 2023-12-16T08:54:53.408489445Z [fc68fbcdea7d:2911080] [ 5] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(+0x110553)[0x7c67ea034553]
ERROR 2023-12-16T08:54:53.408499630Z [fc68fbcdea7d:2911080] [ 6] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x8c)[0x7c67e9feae8c]
ERROR 2023-12-16T08:54:53.408522510Z [fc68fbcdea7d:2911080] [ 7] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x8024)[0x7c67e9f96f74]
ERROR 2023-12-16T08:54:53.408677819Z [fc68fbcdea7d:2911080] [ 8] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(+0x1b9c34)[0x7c67ea0ddc34]
ERROR 2023-12-16T08:54:53.408701982Z [fc68fbcdea7d:2911080] [ 9] /workdir/.pyenv/versions/3.10.12/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x8d77)[0x7c67e9f97cc7]
...
I would like jobs to always get terminated in this case, but I could not find any pointers why this does not happen automatically.
Thanks!