My team works with a Cloud Run backend service that runs a gRPC server, with a separate Cloud Run service running ESPv2 as an API gateway - we followed this guide when setting everything up. This generally works well, with one intermittent but not totally rare issue - we've observed a few instances of very high latency in recent months, which lasts for a period on the order of 30 - 60 minutes.
During these issues, every request to our Cloud Endpoints service hits the configured timeout (60s) without ever seeming to proxy the request to the Cloud Run backend. I should add that it appears this issue affects a single container instance at a time - we usually have 2 instances running, so in this case there's still 1 handling requests as normal.
We've pursued cases with Google Cloud Support several times related to these issues, but gotten nowhere. And given that this service is running a Google provided image, we can't add logging or make other code changes that might give us insight into what's going on.
Any help or direction for investigation would be greatly appreciated, as while this is fairly infrequent, occurrences of this issue can have pretty major production impact for us.