Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Batch - using instance templates

Hello,

We're trying to set up a GCP batch processing with a custom instance template but we hit an issue.

The problem is that when the VM boots up, it stays idle for about 20 minutes, then it gets shut down. I described the VM with 'gcloud' and noticed it says

- description: Job state is set from QUEUED to SCHEDULED for job projects/12345/locations/us-central1/jobs/test-5.
    eventTime: '2023-03-01T03:25:36.379504559Z'
    type: STATUS_CHANGED
- description: no VM has agent reporting correctly within the time window 1080 seconds.
      VM state for instance j-4a6fec51-bd73-40c3-87cc-b67614f4d882-group0-0-2vgv is
      2023/03/01-03:26:38+0000,startup,51,unsupported_cos.
    eventTime: '2023-03-01T03:46:35.801343563Z'
    type: OPERATIONAL_INFO

I confirmed the permissions are OK, the service account we use has the `Batch Agent Reporter` policy attached.

I checked for the networking issues, at least the Network tester says that we have packets flowing just as expected, we also have CloudNAT set up in the project.

I found a StackOverflow thread talking about a specific type of images that Batch is using when spinning up VMs for "script" jobs, but our job is expected to have access to:
1. Shared VPC
2. Specific services we run
We'd like to bundle up our code into a docker image, then create our own flavour of instance template and then run it as Batch.
The thread on SO, talks about `batch-cos-stable-official` but I cannot find a VM image even remotely close to that specific name.

Could you please advise how we should proceed with setting up our instance template or how to get access to batch-specific VM images?

 

Thanks in advance for your support

Regards,
Mikolaj M.

0 7 922
7 REPLIES 7

Hi @mmett ,

Were you able to run a diagnostics and query of the cloud audit logs?

Hey LarryNic,

Not sure what are the diagnostics you're referring to.

Audit logs are not showing anything besides inserting (creating?) VM, then deleting it.

Hi @mmett ,

Thanks for reaching out.

Did you have Cloud Logging enabled for the job?  If so, did you see more detailed error message from Cloud Logging for the job (you can reach the Cloud Logging entries for the job from the job detail page in the console).  This needs roles/logging.logWriter on your service account.

Multiple reasons could cause the issue. 

  • Does your service account has permission to pull containers if you use Artifact Registry (need roles/artifactregistry.reader) or Container Registry (need roles/storage.objectViewer)?  
  • What is the VM image you use in your instance template? The image may not be compatible with Batch. But generally debian, container optimized os, or centos 7 based images should work.
  • Does your shared vpc setup allow the VM to access internet or at least Google APIs?

Hey @bolianyin 

Yes, we do have 'roles/logging.logWriter' set on the service account. For whatever twisted reason, some of the logs are not visible, I'll double check the log sink config, but IIRC, we temporarily removed the configuration under the project we try to run Batch jobs in.
The curious thing is that when I click the "job logs" button in the job itself, it does not show the logs, only when I remove the 'logName' query filter, some logs are visible.

The service account also have 'roles/storage.objectViewer' role set. I'm sure it is not a problem with fetching the container.

Unfortunately, the VM image on instance template is not a COS one. Somehow, I don't see the batch-cos images in the dropdown when creating/modifying the instance template. I guess that once we make Batch to use its specific VM images, it would work just fine.

Shared VPC allows access to Google APIs at the very least (as in we are able to access BigTable/BigQuery/otherGCPservices), we do have CloudNAT set up on the VPC too - I'm guessing this gives us access to the Internet too.

To confirm, our `PROJECT_ID-(PII Removed by staff)` service account has following roles attached:

- Batch Agent Reporter
- Logs Writer
- Service Account User
- Storage Admin
- Storage Object Admin

One more update, a recent try failed with previously mentioned reason (Batch Agent not able to ping Batch service back).
Details of the last configuration used: I submitted a job but I did not specify the instance template and left that part to be handled by Batch service, but I still provided a URI to our container image.

I still have the VM page open and can see that it used 'batch-cos-stable-official-20230215-01-p00' image, so that's good but... it looks like the issue might have something to do with networking connectivity after all. I'll dig more into networking configuration.

OTOH, I'd like to get to know how to create and use a custom instance template to spin up a more tailored VM for Batch loads.

>> only when I remove the 'logName' query filter, some logs are visible.

Interesting. Do you see anything useful there? Another logName filter you can try is "batch_agent_logs", which logs activity of the Batch Agent running on your VM.

>> Shared VPC allows access to Google APIs at the very least (as in we are able to access BigTable/BigQuery/otherGCPservices), 

That should be enough to work with Batch. So, it might be other issues than network.

>> OTOH, I'd like to get to know how to create and use a custom instance template to spin up a more tailored VM for Batch loads.

 We are working details of creating custom VM compatible with Batch service. We should have public documentation once it is ready. Stay tuned.

If you post your job UID (avaiable from UI or when you do 'gcloud batch jobs describe ...'), we may be able to investigate more on the service side.