Hi,
I'm facing a problem where our Cloud Run services return a brief spike of 5xx responses when we deploy a new revision and we see the Cloud Run error "The request was aborted because there was no available instance." We can also see in the logs requests are waiting around for the Cloud Run timeout we've set for the service before failing.
We're using HTTP health checks on our services (both startup/readiness checks and health/liveness checks).
I'm not sure what's going on. Right now my suspicion is traffic is being switched over to the new revision before it's ready.
I read this Google Blog Post that has me wondering what health checks affect. It's a bit ambiguous, it seems to suggest that health checks are used for determining whether an individual container can receive traffic or needs to be restarted. Whether they're also used to determine if a revision is healthy or is ready to receive traffic is unclear. The blog says:
"They [Cloud Run health checks] do not provide a "service-level" health check at the load balancer layer to indicate whether your service is overall healthy or not, instead they focus on keeping the overall quality of or your service high (lowering your potential error-rate) by ensuring containers that Cloud Run is running for you are available to actually perform the requested work."
We're using the google-github-actions/deploy-cloudrun@v2 GitHub action to deploy new revisions. First we deploy a new revision with --no-traffic, then in the next step we switch traffic over to the new revision (we do it in two steps to support canary deployments).
Thanks for reading!
These are our deployment steps:
# Deploy new revision but do NOT switch traffic over yet
- name: Deploy to Cloud Run Production
uses: google-github-actions/deploy-cloudrun@v2
with:
service: <service name>
region: <region>
image: <image>
tag: <new version>
no_traffic: true
# Switch traffic over to new revision
- name: Enable traffic
uses: google-github-actions/deploy-cloudrun@v2
with:
service: <service name>
region: <region>
tag_traffic: <new version>=<100%>
# Remove previous revision's tags
- name: Remove Old Revision Tags
if: ${{ <new version> != <old version> }}
run: gcloud run services update-traffic <service> --remove-tags <old version> --region <region> || true
New discovery: this looks interesting, I used gcloud to describe the active Cloud Run revision of a service
In the status section I see:
I'm wondering if it's possible to route traffic to the new instance before it's been scaled up to min instances. It looks that way. I'm now also wondering if there's a way to wait for a revision to scale up?
Status Conditions from gcloud run revision revisions describe ... --format=json:
"status": {
"conditions": [
{
"lastTransitionTime": "2024-03-12T17:15:26.700598Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2024-03-12T17:14:35.255142Z",
"severity": "Info",
"status": "True",
"type": "Active"
},
{
"lastTransitionTime": "2024-03-12T17:13:24.863568Z",
"status": "True",
"type": "ContainerHealthy"
},
{
"lastTransitionTime": "2024-03-12T17:12:50.034016Z",
"status": "True",
"type": "ContainerReady"
},
{
"lastTransitionTime": "2024-03-12T17:15:26.638833Z",
"status": "True",
"type": "MinInstancesProvisioned"
},
{
"lastTransitionTime": "2024-03-12T17:12:58.289282Z",
"status": "True",
"type": "ResourcesAvailable"
}
],
...
}