Hi, I am trying to train an efficientvit model from the github repository at https://github.com/mit-han-lab/efficientvit.git. I cloned the repository and pip installed all the necessary modules and imported them. Then I ran this command to start train the model, as described in the project's TRAINING.md:
After submit this command, I saw "Waiting for the current execution to complete", and not other output, for the next 4 hours. Then I went to bed and next morning the runtime appears to have terminated.
Typically, as training progresses, status info is printed out. But I didn't see anything printed out. I also don't know what is the result of the training run. What can I do to rectify this? Any help is appreciated.
I believe the error only displays during runtime , you can try going to Cloud Logging and query the resource and narrow down the date for the instance happened.
I would suggest though your next attempts is to disable idle shutdown in runtime templates as this option is enabled by default:
All the runtime that will be created from this template will have no idle shutdown upon unticking this option.
i am having similar problem
i subscribed to Google Colab Pro+ to train a seq2seq with attention mechanism for the NLP project however after almost an hour the runtime changes to connecting---- and at the bottom the "the Waiting to finish the current execution" appears and I am frustrated as I am worried about losing the variables because almost right now I consumed a lot of my GPU units since the training was going on more than 6 hours.
so is there any problem for retaining the variables specially the training history variable?