Re: Cloud run job retrying without any error

chethan108225 · 07-10-2024 03:55 AM

Hi, I have created a cloud run job where each task would take certain number of rows from bigquery table and pass through some Api and update the response of APi for the corresponding record in the table.

But some times job will retry without having no error in it. And the execution starts from the beginning for the retried task

What could be the reason here.

SinghPavan

So, the cloud run job has execution time and no. of retry. If the execution didn't get completed within stipulated time the job will retry until the no. of retry gets completed.

Thank you.

chethan108225

I have set task timeout of 2 hours.

But sometimes the task used to retry after running for few minutes even though no error is generated in the job.

What could be the reason here for retrying the job without even generating any error.

Please let me know

SinghPavan

Can you check if the job getting exited before retrying. Also, can you attach screenshot of the log before retrying and after retrying.

chethan108225

The job which is running fine used to stuck in the middle and after few seconds the job used to retry and start running from the beginning.

And here there is no error log generated which is difficult to debug the issue.

chethan108225

@knet , @DamianS can you please provide any input here.

DamianS

Hello @chethan108225 ,
Might be related with insufficient resources? I mean, maybe service ( or being more accurate, container) getting OOM due to bq query complexity ?
I'm shooting at this moment, as from your description is hard to say. Are you able to execute job ( or run CLoud Run service), make an screenshot with standard metrics like CPU, mem, latency etc and paste those info here?
PS: Maybe we could utilize Trace and check latencies and parameters from that?
https://cloud.google.com/trace/docs/trace-app-latency?hl=en&_gl=1*oul4kc*_ga*OTQxMjM5MjU3LjE3MTM4NTQ...

--
cheers,
DamianS
LinkedIn medium.com Cloudskillsboost

chethan108225

Hello @DamianS

Thanks for your response.

Let me give you more details on this.

Basically each job would pickup 150 records every 20 minutes from bigquery table and pass those specific records from the row to 3 APIs(outside of GCP) one after the other and update the response of the API to the corresponding row in bigquery table.

Here cpu allocated - 4, Memory allocated - 16 GiB for the cloud run job I created with parallelism factor of 5.

knet

A few questions:

- Does this happen fairly predictably (i.e. most of the time)? Or does is happen sometimes, maybe in clusters (some days the job runs fine, other days it has this strange retry behavior)?

- Is there really nothing in the logs?

- What is the memory usage like - are you perhaps running out of memory?

knet

Could you also tell us which region you're running in?

knet

Hi, we'd like to take a closer look at what's going on; if you send me a private message with your project ID I'll pass it to our engineering team to take a look.

Thanks,

Karolina

(Cloud Run jobs product manager)

chethan108225

Hi @knet,

Appreciate your response on this.

1. it happens at least 5- 6 times some day , some day it will just run fine and there are no error logs generated when it retries.

2.I'm not sure this is because of out of memory.

3.The job is running in the region Us-east1

Siddharth26

Hey im facing the same issue did you find any solution to this