Hello everyone!
I am trying to create a custom training job in Vertex AI.
I can successfully use gcloud ai custom-jobs create and specify machine and gpu types, a custom container etc. The job starts successfully. However, I haven't figured out how can I specify a checkpoint directory within my bucket for saving my model while training. Inside the training script "os.getenv('AIP_MODEL_DIR') " is not available when not setting an output directory.
When using the console there is an option to select 'Model output directory'. Do you know how can I specify this within the gcloud ai custom-jobs command in the terminal? I think it should be the staging_bucket argument in the CustomTrainingJob class or/and the baseOutputDirectory in CustomJobSpec?
Cheers!
Hi @MaryAzr ,
Welcome to Google Cloud Community!
You are correct, you need to specify the baseOutputDirectory within CustomJobSpec when using gcloud ai custom-jobs create. This will define the location where your model checkpoints and other training artifacts will be saved.
Here are possible steps that might help you specify a checkpoint within the gcloud ai custom-jobs command in the terminal:
1. Define your CustomJobSpec :
{
"jobId": "your-job-id",
"trainingSpec": {
"workerPoolSpecs": [
{
"machineSpec": {
"machineType": "n1-standard-1"
},
"replicaCount": 1
}
],
"pythonPackageSpec": {
"executorImageUri": "your-container-image-uri"
},
"baseOutputDirectory": "gs://your-bucket-name/your-output-directory"
}
}
2. Use gcloud ai custom-jobs create :
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name="your-job-name" \
--config=your-custom-job-spec.json
Here are important notes to remember :
I hope the above information is helpful.