Re: Cloud run tasks, canceled before they start

ericdcobb · 02-06-2025 06:33 AM

Hello,

We create a decent number of cloud run tasks with scheduled workflows, though I don't believe we are anywhere near the quota (topping out at about 30 running concurrently at peak during the day).

Until about 48 hours ago, this worked great. When jobs failed, it was because of errors in code, an API not responding, etc. The last 48-72 hours I have been seeing 5-10 tasks per day "failing", I use the quotes they because they actually have a status of "canceled". When this happens, there are no logs for that execution when you "view logs". If I remove this:

labels."run.googleapis.com/task_index" = "0"

From the log filter, I actually do see one log message for the canceled task, but I am not seeing anything interesting in them. For example here is protoPayload.response.status.conditions:

[
            {
              "type": "Completed",
              "status": "False",
              "lastTransitionTime": "2025-02-06T12:02:01.291102Z"
            },
            {
              "type": "ResourcesAvailable",
              "status": "True",
              "lastTransitionTime": "2025-02-06T12:01:59.086478Z"
            },
            {
              "type": "Started",
              "status": "True",
              "lastTransitionTime": "2025-02-06T12:02:00.243731Z"
            },
            {
              "type": "ContainerReady",
              "status": "True",
              "lastTransitionTime": "2025-02-06T12:01:58.734889Z"
            },
            {
              "type": "Retry",
              "status": "True",
              "reason": "ImmediateRetry",
              "message": "System will retry after 0:00:00 from lastTransitionTime for attempt 0.",
              "lastTransitionTime": "2025-02-06T12:02:02.847796Z",
              "severity": "Info"
            }
          ]

As I said, this service has been very reliable until recently. It certainly could be some mistake we are making, but we haven't made any changes to our infrastructure, and I'm at a loss. Any tips about where I should look to troubleshoot this would be appreciated.

The same workflow immediately creates other tasks that have no issues. I can not see a pattern as to why a small subset have this problem.

This is the value of protopayload.status in the log I can find:

{
        "code": 2,
        "message": "Execution archx-task-medium-tb9n6 has failed to complete, 0/1 tasks were a success."
      }

knet

Thank you for posting here, especially with the details and timeline! The engineering team is looking to see if this is an issue on our side.

david74chou

Hi @knet

Any update on this issue? We have met the same issue 3 weeks ago, but can’t find any cause from our side. It’s more like an infra migration from the CloudRun side.

knet

Hi, I just came across this topic again. I did see similar issues reported to our eng team through other channels. Please let me know if you're still seeing this.

ericdcobb

It continued for a few days after my original post and we haven't seen the issue for ~4-5 weeks, lining up with what @david74chou posted above.

knet

That's great to hear, thanks for confirming!