Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Batch issues; newbie question

Is it sufficient for your service account for a batch job to have all of the permissions under the batch.agentReporter role, or must it specifically have the batch.agentReporter role assigned to it?

I'm trying to troubleshoot the following error:
description: no VM has agent reporting correctly within the time window 1080 seconds.

I'm not using an external IP address, so I think I can rule out network issues.

Solved Solved
0 2 1,337
1 ACCEPTED SOLUTION

Howdy prinehart and welcome to the community!!!

Your description matches this link and that makes it clear why you are asking about the permissions and mentioning network.  Let's knock the permissions part out of the way first.  In Google Cloud, we associate "roles" with a principal.  Roles are named collections of permissions.  Google is saying that the service account must have the role roles/batch.agentReporter.  When we then look up that role we find this link which says that the role (currently) contains just a single permission called batch.states.report.  So ... what this tells me is that your service account needs one of:

  • Granted the role roles/batch.agentReporter
  • Granted a custom role that includes batch.states.report

There should be no need to be granted the role roles/batch.agentReporter if you have been granted a custom role that contains the required permission.

The other possibility described in the article is a network issue.  Let's see if we can't think through what that means.  When you run a batch job, Google spins up Compute Engine and runs your job in a container on that compute engine.  When the job ends, the compute engine is destroyed.  I sense that the compute engine (by default) is configured to run with an external IP address.  This is to allow the Compute Engine to "call back" to the batch service to report the outcome and status of the work being performed.  I also note in the docs that there is the concept of explicitly disabling external IP addresses BUT if you do that, then you MUST also configured Cloud NAT and enable private access.
Maybe post your JSON config file (obfuscating anything you consider sensitive).  I'd also suggest a trawl through your Cloud Logging logs at the time when you ran the batch job to see if there are additional log entries that can help with the diagnosis.

 

View solution in original post

2 REPLIES 2

Howdy prinehart and welcome to the community!!!

Your description matches this link and that makes it clear why you are asking about the permissions and mentioning network.  Let's knock the permissions part out of the way first.  In Google Cloud, we associate "roles" with a principal.  Roles are named collections of permissions.  Google is saying that the service account must have the role roles/batch.agentReporter.  When we then look up that role we find this link which says that the role (currently) contains just a single permission called batch.states.report.  So ... what this tells me is that your service account needs one of:

  • Granted the role roles/batch.agentReporter
  • Granted a custom role that includes batch.states.report

There should be no need to be granted the role roles/batch.agentReporter if you have been granted a custom role that contains the required permission.

The other possibility described in the article is a network issue.  Let's see if we can't think through what that means.  When you run a batch job, Google spins up Compute Engine and runs your job in a container on that compute engine.  When the job ends, the compute engine is destroyed.  I sense that the compute engine (by default) is configured to run with an external IP address.  This is to allow the Compute Engine to "call back" to the batch service to report the outcome and status of the work being performed.  I also note in the docs that there is the concept of explicitly disabling external IP addresses BUT if you do that, then you MUST also configured Cloud NAT and enable private access.
Maybe post your JSON config file (obfuscating anything you consider sensitive).  I'd also suggest a trawl through your Cloud Logging logs at the time when you ran the batch job to see if there are additional log entries that can help with the diagnosis.

 

The missing Cloud NAT service was the culprit.