Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Vertex AI Custom Training Job Container not finding my module: Error while finding module for '...'

Hello,

I have a PyTorch training job that I am packaging in a Python software distribution (.tar.gz file). I upload the sdist to a GCS bucket and run it in a container using the gcloud ai custom-jobs create CLI.

Up until a couple of weeks ago this worked fine but in recent days my jobs consistently fail with messages like these appearing in their logs:

Running command: python3 -m MyPackage.MyModule --job-dir=gs://my-bucket/my-job/model --model-name=my-model

/opt/conda/bin/python3: Error while finding module specification for 'MyPackage.MyModule' (ModuleNotFoundError: No module named 'MyPackage.MyModule')

 

MyPackage.MyModule is my module where my training code runs, naturally.

As I've mentioned above the same procedure worked until recently. There have not been any changes to it and I can clearly see that MyModule.py is located under MyPackage in my .tar.gz file.

The container image that I am using is us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-9:latest and from what I can tell it has not changed since the time I successfully used it before.

Why is the Vertex AI container not finding my training module? How can I further debug and fix this?

0 2 1,099
2 REPLIES 2