Long response times when running a Cloud Run Job t...

urig · 01-11-2024 11:10 AM

Hello,

I have this Python code that runs a Cloud Run Job Run using the google-cloud-run Python SDK:

    client = run_v2.JobsClient()
    job_request = run_v2.RunJobRequest(dict(
        name="projects/my-project/locations/my-location/jobs/my-job",
        overrides=dict(
            container_overrides=[dict(
                args=["some", "command", "line", "arguments"]
            )]
        )
    ))
    operation = client.run_job(request=job_request)

The code usually works well and the call to client_run_job() successfully returns in under 3 seconds. Occasionally, and for long several hours, the same call consistently takes 87 seconds to return (successfully running the job).

I've enabled logs for gRPC and they look like this:

D0111 17:55:06.794000000 22720 src/core/ext/filters/client_channel/resolver/dns/dns_resolver_plugin.cc:57] Using native dns resolver
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "rls_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_manager_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_impl_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "cds_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_cluster_resolver_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_override_host_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "xds_wrr_locality_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/load_balancing/lb_policy_registry.cc:46] registering LB policy factory for "ring_hash_experimental"
D0111 17:55:06.794000000 22720 src/core/lib/security/certificate_provider/certificate_provider_registry.cc:33] registering certificate provider factory for "file_watcher"

I0111 17:56:32.489000000 22720 src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:216] WorkStealingThreadPoolImpl::Quiesce
D0111 17:56:32.489000000 22720 src/core/lib/event_engine/thread_pool/work_stealing_thread_pool.cc:327] Waiting for lifeguard thread to shut down
D0111 17:56:32.489000000 22720 src/core/lib/surface/init.cc:159] grpc_shutdown starts clean-up now

A few hours later when the code is back to normal response times the gap in the logs is gone.

Is this an issue with the Cloud Run API service or OAuth services? How can I get this issue to be resolved?

Kind regards,
urig

djs_75

Have you configured Cloud Trace Instrument for Cloud Trace | Google Cloud to narrow down if it is an authentication or api response issue? I'd run a number of tests in the console then look at the trace dashboard - if that doesn't provide anything - I would go to the REST page and submit requests to the service to force the trace behavior

urig

Hi. Thanks for suggesting cloud trace. I've already pinpointed the point in my code where the delay happens - it's in the call to client.run_job(). I need to figure out why the interaction between the client SDK and the cloud services is taking so long. Do you mean there's a way to make Cloud Trace show me more insightful logs/traces of that? I couldn't find references to that in the link.
Thanks, urig

djs_75

you incorporate it within the code - on the page it has the links for OpenTelemtry...as you state you're doing python Python | OpenTelemetry I would add the packages (example Auto-Instrumentation Example | OpenTelemetry), add a trace call and wrap the method call...

urig

Thanks.

I enabled GRPC tracing and logging via environment variables () and I see a long loop of log messages that repeat themselves. Might anyone know what they mean? Why they repeat endlessly until timeout around 5 minutes?

Here's a small example:

I0117 19:43:27.062342247 32 timer_manager.cc:201] sleep until kicked
D0117 19:43:27.062335862 32 timer_generic.cc:722] TIMER CHECK END: r=1; next=552366
D0117 19:43:27.062327485 32 timer_generic.cc:696] TIMER CHECK BEGIN: now=552067 next=9223372036854775807 tls_min=552067 glob_min=552366
D0117 19:43:27.062314648 32 timer_generic.cc:377] .. add to shard 1 with queue_deadline_cap=552365 => is_first_timer=false
D0117 19:43:27.062305246 32 timer_generic.cc:341] TIMER 0x2a0cf4a4afc0: SET 557067 now 552067 call 0x2a0cf4a4aff8[0x3ebff10e1880]
I0117 19:43:27.062284529 25 timer_manager.cc:191] sleep for a 299 milliseconds
D0117 19:43:27.062276520 25 timer_generic.cc:722] TIMER CHECK END: r=1; next=552366
D0117 19:43:27.062266008 25 timer_generic.cc:696] TIMER CHECK BEGIN: now=552067 next=9223372036854775807 tls_min=547290 glob_min=552366
I0117 19:43:27.062250446 25 timer_manager.cc:207] wait ended: was_timed:0 kicked:0
I0117 19:43:27.062220850 32 timer_manager.cc:135] flush exec_ctx
I0117 19:43:27.062206237 32 timer_manager.cc:127] kick untimed waiter
D0117 19:43:27.062198946 32 timer_generic.cc:722] TIMER CHECK END: r=2; next=552366
D0117 19:43:27.062189429 32 timer_generic.cc:615] .. result --> 2, shard[1]->min_deadline 552067 --> 552366, now=552067
D0117 19:43:27.062181873 32 timer_generic.cc:556] .. shard[1] popped 1
D0117 19:43:27.062173954 32 timer_generic.cc:512] .. shard[1]: heap_empty=true
D0117 19:43:27.062165017 32 timer_generic.cc:532] TIMER 0x2a0cf4a4afc0: FIRE 0ms late
D0117 19:43:27.062156414 32 timer_generic.cc:525] .. check top timer deadline=552067 now=552067
D0117 19:43:27.062147640 32 timer_generic.cc:512] .. shard[1]: heap_empty=false
D0117 19:43:27.062139254 32 timer_generic.cc:596] .. shard[1]->min_deadline = 552067
D0117 19:43:27.062125251 32 timer_generic.cc:696] TIMER CHECK BEGIN: now=552067 next=9223372036854775807 tls_min=551969 glob_min=552067
I0117 19:43:27.062050241 32 timer_manager.cc:207] wait ended: was_timed:1 kicked:0
I0117 19:43:26.980424529 19 completion_queue.cc:957] grpc_completion_queue_next(cq=0x2a0cf4a4f3e0, deadline=gpr_timespec { tv_sec: 1705520607, tv_nsec: 180419823, clock_type: 1 }, reserved=(nil))
I0117 19:43:26.980391399 19 completion_queue.cc:1069] RETURN_EVENT[0x2a0cf4a4f3e0]: QUEUE_TIMEOUT
I0117 19:43:26.964180331 32 timer_manager.cc:191] sleep for a 98 milliseconds
D0117 19:43:26.964173967 32 timer_generic.cc:722] TIMER CHECK END: r=1; next=552067
D0117 19:43:26.964163479 32 timer_generic.cc:615] .. result --> 1, shard[0]->min_deadline 551969 --> 552969, now=551969
D0117 19:43:26.964155323 32 timer_generic.cc:556] .. shard[0] popped 0
D0117 19:43:26.964148894 32 timer_generic.cc:483] .. shard[0]->queue_deadline_cap --> 552968
D0117 19:43:26.964141911 32 timer_generic.cc:512] .. shard[0]: heap_empty=true
D0117 19:43:26.964133619 32 timer_generic.cc:596] .. shard[0]->min_deadline = 551969
D0117 19:43:26.964119859 32 timer_generic.cc:696] TIMER CHECK BEGIN: now=551969 next=9223372036854775807 tls_min=551786 glob_min=551969
I0117 19:43:26.964065863 32 timer_manager.cc:207] wait ended: was_timed:1 kicked:0
I0117 19:43:26.781345196 32 timer_manager.cc:191] sleep for a 183 milliseconds
D0117 19:43:26.781337932 32 timer_generic.cc:722] TIMER CHECK END: r=1; next=551969
D0117 19:43:26.781329961 32 timer_generic.cc:615] .. result --> 1, shard[2]->min_deadline 551786 --> 552787, now=551786
D0117 19:43:26.781322096 32 timer_generic.cc:556] .. shard[2] popped 0
D0117 19:43:26.781311760 32 timer_generic.cc:483] .. shard[2]->queue_deadline_cap --> 552786
D0117 19:43:26.781305549 32 timer_generic.cc:512] .. shard[2]: heap_empty=true
D0117 19:43:26.781298544 32 timer_generic.cc:596] .. shard[2]->min_deadline = 551786
D0117 19:43:26.781283342 32 timer_generic.cc:696] TIMER CHECK BEGIN: now=551786 next=9223372036854775807 tls_min=551365 glob_min=551786
I0117 19:43:26.781253655 32 timer_manager.cc:207] wait ended: was_timed:1 kicked:0
I0117 19:43:26.779145079 19 completion_queue.cc:957] grpc_completion_queue_next(cq=0x2a0cf4a4f3e0, deadline=gpr_timespec { tv_sec: 1705520606, tv_nsec: 979139271, clock_type: 1 }, reserved=(nil))
I0117 19:43:26.779073654 19 completion_queue.cc:1069] RETURN_EVENT[0x2a0cf4a4f3e0]: QUEUE_TIMEOUT
I0117 19:43:26.577235512 19 completion_queue.cc:957] grpc_completion_queue_next(cq=0x2a0cf4a4f3e0, deadline=gpr_timespec { tv_sec: 1705520606, tv_nsec: 777229663, clock_type: 1 }, reserved=(nil))
I0117 19:43:26.577180721 19 completion_queue.cc:1069] RETURN_EVENT[0x2a0cf4a4f3e0]: QUEUE_TIMEOUT
I0117 19:43:26.375641303 19 completion_queue.cc:957] grpc_completion_queue_next(cq=0x2a0cf4a4f3e0, deadline=gpr_timespec { tv_sec: 1705520606, tv_nsec: 575634744, clock_type: 1 }, reserved=(nil))
I0117 19:43:26.375596778 19 completion_queue.cc:1069] RETURN_EVENT[0x2a0cf4a4f3e0]: QUEUE_TIMEOUT
I0117 19:43:26.360353347 32 timer_manager.cc:191] sleep for a 421 milliseconds
D0117 19:43:26.360343908 32 timer_generic.cc:722] TIMER CHECK END: r=1; next=551786

djs_75

hmmm...that is a gRPC timeout - wrap your method in a try/exception to get an error.

you can find similar issues under this repo describing events (I just directed to a known issue as an example not based on language or your direct issue) Problem with keep-alive and streaming connection using C++ · Issue #15260 · grpc/grpc (github.com)

Long response times when running a Cloud Run Job through API