Hello,
I'm trying to deploy a simple SLO for my cloud run service. This is what I have at the moment (terraform, but I also tried creating it in the console, with the same results. It's not a terraform provider problem):
resource "google_monitoring_slo" "request_based_slo" { for_each = toset(var.services_w_slos) service = each.value slo_id = format("requests-slo-%s", each.value) display_name = format("Request-based SLO for %s", each.value) goal = 0.999 rolling_period_days = 28 request_based_sli { good_total_ratio { good_service_filter = join(" AND ", [ "metric.label.\"response_code_class\"=\"2xx\"", "metric.type=\"run.googleapis.com/request_count\"", "resource.type=\"cloud_run_revision\"", format("resource.label.\"project_id\"=\"%s\"", var.project_id), ]) total_service_filter = join(" AND ", [ "metric.type=\"run.googleapis.com/request_count\"", "resource.type=\"cloud_run_revision\"", format("resource.label.\"project_id\"=\"%s\"", var.project_id), ]) } } }
This SLO is really simple: for a rolling period of 28 days, count the number of good requests (2xx status code) over the total number of requests. Nothing more, nothing less.But it doesn't do exactly what I expected (see image):
I sent a few good requests. The SLI was at 100%. Then I sent a bad request (on purpose). The SLI dropped to 80%. I then sent a couple of good requests. The SLI jumped back to 100%.
It looks like the SLI measures the "instantaneous" reliability, without taking into account the previous requests. I suspect the metrics I used aren't really time series. Looking at the run.googleapis.com/request_count, I see that it's sampled every 60s. So basically what I'm getting is a continuous measure of the reliability of my service, for 60s chunks. This is not quite a SLO.
I tried everything and read all the docs I could find, but I couldn't get my SLO to work. What I would like is (IMO) very standard: a ratio of good requests last 28 days / total requests last 28 days.
Am I missing something? Could you give me a hand please?