Monitoring training progress and review results in...

shenzeng · 01-17-2024 06:36 AM

Hi, I am trying to train an efficientvit model from the github repository at https://github.com/mit-han-lab/efficientvit.git. I cloned the repository and pip installed all the necessary modules and imported them. Then I ran this command to start train the model, as described in the project's TRAINING.md:

! torchpack dist-run -np 8 \

python efficientvit/train_cls_model.py efficientvit/configs/cls/imagenet/b1.yaml \

--data_provider.image_size "[128,160,192,224,256,288]" \

--run_config.eval_image_size "[288]" \

--path ./exp/cls/imagenet/b1_r288/

After submit this command, I saw "Waiting for the current execution to complete", and not other output, for the next 4 hours. Then I went to bed and next morning the runtime appears to have terminated.

Typically, as training progresses, status info is printed out. But I didn't see anything printed out. I also don't know what is the result of the training run. What can I do to rectify this? Any help is appreciated.

nceniza

I believe the error only displays during runtime , you can try going to Cloud Logging and query the resource and narrow down the date for the instance happened.

I would suggest though your next attempts is to disable idle shutdown in runtime templates as this option is enabled by default:

All the runtime that will be created from this template will have no idle shutdown upon unticking this option.

Aways

i am having similar problem

i subscribed to Google Colab Pro+ to train a seq2seq with attention mechanism for the NLP project however after almost an hour the runtime changes to connecting---- and at the bottom the "the Waiting to finish the current execution" appears and I am frustrated as I am worried about losing the variables because almost right now I consumed a lot of my GPU units since the training was going on more than 6 hours.

so is there any problem for retaining the variables specially the training history variable?

Monitoring training progress and review results in Colab Enterprise