Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Monitoring training progress and review results in Colab Enterprise

Hi, I am trying to train an efficientvit model from the github repository at https://github.com/mit-han-lab/efficientvit.git. I cloned the repository and pip installed all the necessary modules and imported them. Then I ran this command to start train the model, as described in the project's TRAINING.md:

          ! torchpack dist-run -np 8 \
          python efficientvit/train_cls_model.py efficientvit/configs/cls/imagenet/b1.yaml \
          --data_provider.image_size "[128,160,192,224,256,288]" \
          --run_config.eval_image_size "[288]" \
          --path ./exp/cls/imagenet/b1_r288/

After submit this command, I saw "Waiting for the current execution to complete", and not other output, for the next 4 hours. Then I went to bed and next morning the runtime appears to have terminated. 

Typically, as training progresses, status info is printed out. But I didn't see anything printed out. I also don't know what is the result of the training run. What can I do to rectify this? Any help is appreciated. 

1 2 1,077
2 REPLIES 2

I believe the error only displays during runtime , you can try going to Cloud Logging and query the resource and narrow down the date for the instance happened.

I would suggest though your next attempts is to disable idle shutdown in runtime templates as this option is enabled by default:

nceniza_0-1705958093854.png

All the runtime that will be created from this template will have no idle shutdown upon unticking this option.

 

 

i am having similar problem

i subscribed to Google Colab Pro+ to train a seq2seq with attention mechanism for the NLP project however after almost an hour the runtime changes to connecting---- and at the bottom the "the Waiting to finish the current execution" appears and I am frustrated as I am worried about losing the variables because almost right now I consumed a lot of my GPU units since the training was going on more than 6 hours.

so is there any problem for retaining the variables specially the training history variable?

Top Solution Authors