Re: How to specify staging bucket with gcloud ai c...

MaryAzr · 11-02-2024 04:25 AM

Hello everyone!

I am trying to create a custom training job in Vertex AI.

I can successfully use gcloud ai custom-jobs create and specify machine and gpu types, a custom container etc. The job starts successfully. However, I haven't figured out how can I specify a checkpoint directory within my bucket for saving my model while training. Inside the training script "os.getenv('AIP_MODEL_DIR') " is not available when not setting an output directory.

When using the console there is an option to select 'Model output directory'. Do you know how can I specify this within the gcloud ai custom-jobs command in the terminal? I think it should be the staging_bucket argument in the CustomTrainingJob class or/and the baseOutputDirectory in CustomJobSpec?

Cheers!

MJane

Hi @MaryAzr ,

Welcome to Google Cloud Community!

You are correct, you need to specify the baseOutputDirectory within CustomJobSpec when using gcloud ai custom-jobs create. This will define the location where your model checkpoints and other training artifacts will be saved.

Here are possible steps that might help you specify a checkpoint within the gcloud ai custom-jobs command in the terminal:

1. Define your CustomJobSpec :


 {
  "jobId": "your-job-id",
  "trainingSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "n1-standard-1"
        },
        "replicaCount": 1
      }
    ],
    "pythonPackageSpec": {
      "executorImageUri": "your-container-image-uri"
    },
    "baseOutputDirectory": "gs://your-bucket-name/your-output-directory"
  }
}

2. Use gcloud ai custom-jobs create :

gcloud ai custom-jobs create \
  --region=us-central1 \
  --display-name="your-job-name" \
  --config=your-custom-job-spec.json

Here are important notes to remember :

Permissions - Make sure that your service account has the necessary permissions to write the specified Google Cloud Storage bucket.
Environment Variable - The 'AIP_MODEL_DIR' environment variable will be set to the baseOutputDirectory you specified, allowing your training script to access the correct location for saving checkpoints and other artifacts.

I hope the above information is helpful.

How to specify staging bucket with gcloud ai custom-jobs create command?