Solved: Cloud Run Jobs stuck at pending and no logs

Jlguti

Hello,

I have around 6-7 Cloud Run Jobs configured all with the exact same settings/resources and in us-east1 region.

Everything has been working just fine till today, some of the jobs seem to be stuck at the pending status and never give any kind of logs for the execution, this is the maximum information I have been able to get and it comes from the YAML File:

status:

  observedGeneration: 1

  conditions:

  - type: Completed

    status: 'False'

    message: Resource readiness deadline exceeded.

    lastTransitionTime: '2024-04-19T22:28:30.330446Z'

  - type: Started

    status: Unknown

    message: Deadline exceeded. Container image import still in progress.

    lastTransitionTime: '2024-04-19T22:28:30.330446Z'

  - type: Retry

    status: 'True'

    reason: WaitingForOperation

    message: System will retry after 1:00:00 from lastTransitionTime for polling interval

    lastTransitionTime: '2024-04-19T22:33:31.183686Z'

    severity: Info

I have already tried the following:

Made sure Service Account is correct and has the roles for the service.
Deleted the Docker Image from Artifact Registry and uploaded a new one.
Deleted the "old" job and created a new one with the new image.

I haven't had any success yet with the jobs that are stuck at pending, however some of the other jobs (with the exact same configurations) are working just fine.

Any idea on why is this problem happening or how to fix it.

Thanks in advance for the help.

papanouel

Well... happy I didn't give up... It does work now.

My Cloud Run jobs where created to run on the us-central1 region. I duplicated the Run Jobs with the exact same settings with the exception of the region location, where I set it to asia-northern1. After I made the change, I updated my worklow, to launch the new job with the correct new region (otherwise workflow won't find the job of course), and Voila! My Workflow and Cloud Run Job is working like it used to. The execution still didn't complete (it is a 1hour several run job execution) but as soon as worklow execute a Job, the status is Running instead of Pending.

Not sure if it is a Cloud Run region server issue, but if it was the case it is mentioned nowhere... When I'll have more time I'll investigate this further. I don't want to face a similar issue without knowing the exact reason why I need to suddenly change the region....

Hope that helps

View solution in original post

mgdurrant

I am having a similar problem with Batch. A job submission script I've used in the past is not working any more for some reason. It just says "VM provisional model: Pending" and fails to spin up any VMs and keeps the status "Queued."

EthanLin

I have the same problem with Cloud Run. After my cloud run use new image I push (the same as previous image), it show the error message as above.

alfredomagallon

Same here, we opened a ticket, but sadly no news!

papanouel

Hi everyone,

Same here. My workflow was executing a Cloud run job fine till January. It suddenly stopped working yesterday.

After 10min of execution, my workflow return with the following messages:

"Resource readiness deadline exceeded."

"Deadline exceeded. Container image import still in progress."

"System will retry after 0:01:00 from lastTransitionTime for polling interval"

No modification or what so ever has been made. The google cloud job is not even been executed. Though nothing has been modified since last January, I double checked the image used by Cloud Run, but, still the same. I tried adding memories to the execution environment for the job (we never know...), but nothing...

That is frustrating as I don't believe there anything else I can do at this stage.

It doesn't seem like there's any server issue on GCP 😕

papanouel

Well... happy I didn't give up... It does work now.

My Cloud Run jobs where created to run on the us-central1 region. I duplicated the Run Jobs with the exact same settings with the exception of the region location, where I set it to asia-northern1. After I made the change, I updated my worklow, to launch the new job with the correct new region (otherwise workflow won't find the job of course), and Voila! My Workflow and Cloud Run Job is working like it used to. The execution still didn't complete (it is a 1hour several run job execution) but as soon as worklow execute a Job, the status is Running instead of Pending.

Not sure if it is a Cloud Run region server issue, but if it was the case it is mentioned nowhere... When I'll have more time I'll investigate this further. I don't want to face a similar issue without knowing the exact reason why I need to suddenly change the region....

Hope that helps

Jlguti

Yea, It seems that for future issues with this services (Batch or Cloud Run Jobs) the workaround is to generate the same resource in another zone.

I was having issues with US-EAST1 instance and replicated it in SA-WEST and worked perfectly.

Hopefully Google is more aware of these kind of issues as this was happening since Thursday, at least for me.

alfredomagallon

Thanks @papanouel

Maybe the issue is only affecting some regions, so moving the job is a workaround, not a solution.

EthanLin

I discovered that my image is stored in the Artifact Registry at asia-east1, while Cloud Run is located in us-central1. When I move the image to us-central1, everything works well.

But the question is: If I use previous image stored at asia-east1 (which was pushed at 2024-03-31), Cloud Run does not show any error. Only newly pushed image require setting the same region as Cloud Run.

papanouel

Thanks @EthanLin

Interesting. I checked on my side as well and my Artifact Registry image is also in asia-northern1 region. It was working well so far with image and cloud run being on a different region but I guess not anymore.

BTW my image was last uploaded back then in January and it was not working since Monday. To be more specific, the overall process is being executed only every Wednesday and every Monday. If I would have tried on last Thursday, maybe it would have not worked.

Anyway, happy to hear this workaround (which may not be one?) is working for everyone. Hopefully we’ll have some clarification from Google on this.

alfredomagallon

The issue has now been resolved.