Vertex AI Custom Training Job Container not findin...

urig · 04-25-2022 02:48 AM

Hello,

I have a PyTorch training job that I am packaging in a Python software distribution (.tar.gz file). I upload the sdist to a GCS bucket and run it in a container using the gcloud ai custom-jobs create CLI.

Up until a couple of weeks ago this worked fine but in recent days my jobs consistently fail with messages like these appearing in their logs:

Running command: python3 -m MyPackage.MyModule --job-dir=gs://my-bucket/my-job/model --model-name=my-model

/opt/conda/bin/python3: Error while finding module specification for 'MyPackage.MyModule' (ModuleNotFoundError: No module named 'MyPackage.MyModule')

MyPackage.MyModule is my module where my training code runs, naturally.

As I've mentioned above the same procedure worked until recently. There have not been any changes to it and I can clearly see that MyModule.py is located under MyPackage in my .tar.gz file.

The container image that I am using is us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-9:latest and from what I can tell it has not changed since the time I successfully used it before.

Why is the Vertex AI container not finding my training module? How can I further debug and fix this?

josegutierrez

Check this documentation[1] to see how to fix ModuleNotFoundError.

[1] https://towardsdatascience.com/how-to-fix-modulenotfounderror-and-importerror-248ce5b69b1c

urig

Hi Jose,

Thank you for trying to help. Alas, I've already followed all suggestions in the linked article, to no avail. There is something funky going on between the Vertex AI python code that looks for my module and the way I structured my .tar.gz. At this point, without being able to access the Vertex AI code, I don't see how to debug this.

Vertex AI Custom Training Job Container not finding my module: Error while finding module for '...'