I have a scraping app that uses 2 Cloud Run Services. A frontend CR service built in React and obviously stateless and a backend NodeJS CR service that is private ONLY (accessibly only by front end service). The backend service connects to a Cloud SQL instance via private IP and is also using a REDIS instance. There is also a static egress IP configured for the backend CR service to be able to be whitelisted by a 3rd party API provider.
Cloud Run backend end Configuration
What I am seeing is a bunch of 500 and 429 errors for the Cloud Run backend service. No issue with the front end service. As per CR metrics I do not see any cold starts (0 -> 1 scaling). Max concurrent requests is also very low (around 10)
I am looking to take my app to production and I see that just for around 60 requests , CR instances are scaling to 4 and want to understand the auto scaling behaviour of CR . CPU utilisation is around 64% (60% is the default CPU utilization that is OOB setting from what I understand) and memory utilisation is around 15%. Is the auto scaling happening because CPU Utlization is > 60% and is it possible to change this setting? Question really is why is CR scaling to 4-5 instances with just 60 requests with low CPU and memory usage (I say low because it is around 60% and memory). I don’t see any other reason that explains the auto scaling behaviour for example Cloud SQL not keeping up with CR autoscaling or any SIGTERM crashes which might cause CR to spin up a new instance.
Below are a few metrics
Hello,
Cloud Run indeed scales up when instance CPU utilization reaches 60%; this is not configurable. Cloud Run prioritizes getting your requests served quickly, so it scales up rather than risk not being able to serve a request. We are looking at ways to make scaling more configurable, but today this is how it works.
Do the errors go away if you raise the max instances setting?
Thanks @knet . Yes the errors go away once I increase max instances setting but it reoccurs at a later point. As i said I am seeing 429 and 500 errors and they are intermittent once in 3 days but there are around 15 - 20 occurences of both errors in the past 2 weeks