"Internal error running task" in Cloud Run Jobs

azaky-ol · 11-13-2022 09:43 PM

Hey there,

We've been using Cloud Run Jobs for the last couple of months and it's been great. Even though we knew it was still in preview, Cloud Run Jobs fitted our use case so perfectly that we've built things tightly dependent on it, and plan to build more.

However, since last weekend, we have observed a significant number of failed tasks with the message "Internal error running task". It also seemed that those tasks haven't started, since there is no "Start time" and we found zero logs for the particular tasks.

Here is an example of such a job. All failed tasks here show only "Internal error running task" and nothing else.

Note that the above job is still running at the time of writing. It's been running for over 7 hours. In this particular job, a successful task typically completes in under one minute. We set the timeout to 10 minutes and parallelism to 50.

Also, we noticed that this happened on jobs with a larger number of tasks (> 8k). Our jobs with fewer tasks have finished successfully, even though they ran noticeably slower compared to previous weeks.

Does anyone know what's happening here?

Thanks!

aleno

Hi,

we have exactly the same problem here.

Our two Cloud Run Jobs don't work anymore since last friday.

If you execute them, they go in status pending, after ten minutes they fail with errors but no log files.

The job execution YAML file shows the following message (under status): "Failed to start execution deployment. Deadline exceeded."

Both jobs have a serverless VPC connector attached which worked fine until last week.

Thanks

Alex

mulugu1997

Hey aleno,

We have the same setup cloud run integrated with server less vpc, our cloud run jobs worked fine before 12pm today. After that all the jobs were failed with no logs. Any update regarding this? But we want to have serverless vpc in place.

aleno

Hi mulugu1997,
i solved the issue yesterday.
It's pretty confusing and strange but it was a permission problem in our case.
I can't say if there were multiple problems at once but this one was the solution for us:

- VPC host project: IAM, check the Google managed service account <PII removed by staff>
- This service account must have the role "Serverless VPC Access Service Agent"

This was strangely not the case in our vpc host project because everything worked perfectly before.
I've checked the project activity page but i found nothing that somebody removed this permission.
Very confusing situation...but now everything works again.

Hope that helps
Alex

knet

Thank you for bringing this up; we're looking into this issue.

aleno

Hi knet,

thanks for your reply.
My guess is that something is broken with Cloud Run Jobs and the Serverless VPC connector.
Executing a test container (hello world...) in the same region (eu-west3) and same job parameters but without a Serverless VPC connector seems to work fine.

azaky-ol

Hey knet,

Is there any update about this? Our setup did not use serverless vpc connector, unlike aleno's.

In the meantime, we have set up a Kubernetes cluster and have migrated our workflow using Kube Job.

JGerman23

Complementing, my suggestion for both questions is to create a Public Issue Tracker or contact Cloud support (only if you are paying for a Support Package) so that they can inspect your project to determine the root cause.

phil4

Im running into this issue now too. Anyone come up with a solution?

phil4

Making a brand new image worked, updating the existing one kept failing, but for some reason renaming it and pointing the cloud job at it worked. Very strange.