Hi everyone,
For some reason, my custom training Jobs with custom container keep failing on vertex AI but the local run is working fine (I verified by running with local run as indicated in the docs here. I also built the image and run it manually and it works fine).
Error log when running custom job:
<code>
{
"insertId": "2s7rqvfjzoq4v",
"jsonPayload": {
"attrs": {
"tag": "workerpool0-0"
},
"message": "/opt/conda/bin/python: Error while finding module specification for 'trainer.train' (ModuleNotFoundError: No module named 'trainer')\n",
"levelname": "ERROR"
},
</code>