Hi everyone,
For some reason, my custom training Jobs with custom container keep failing on vertex AI but the local run is working fine (I verified by running with local run as indicated in the docs here. I also built the image and run it manually and it works fine).
Error log when running custom job:
<code>
{
"insertId": "2s7rqvfjzoq4v",
"jsonPayload": {
"attrs": {
"tag": "workerpool0-0"
},
"message": "/opt/conda/bin/python: Error while finding module specification for 'trainer.train' (ModuleNotFoundError: No module named 'trainer')\n",
"levelname": "ERROR"
},
</code>
Can you share the code or gcloud command you're using to submit the job. With a custom container you don't need to define a trainer.
Hi @sascha_heyer ,
Really appreciate for your kind help.
But I was a bit hasty for posting this thread. I started the custom job from a workbench with code:
gcloud.aiplatform.CustomContainerJob(...)
The error logs I received was for the Training Pipeline in Vertex (Not sure why Training Pipelinee was created while I only init the Custom Job).
A while later the logs updates and the training job was executed correctly :))
So I think I'm fine now. Tks so much for your kind help.
p/s: I created trainer module since i was following the docs for custom container with their code structure (they recommend so). Now I learned that I could have just use simple training script instead. Tks again
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |