Impossible to setup proper request-based SLO for c...

JPFrancoia · 09-21-2023 09:53 AM

Hello,

I'm trying to deploy a simple SLO for my cloud run service. This is what I have at the moment (terraform, but I also tried creating it in the console, with the same results. It's not a terraform provider problem):

resource "google_monitoring_slo" "request_based_slo" {
  for_each = toset(var.services_w_slos)

  service      = each.value
  slo_id       = format("requests-slo-%s", each.value)
  display_name = format("Request-based SLO for %s", each.value)

  goal                = 0.999
  rolling_period_days = 28

  request_based_sli {
    good_total_ratio {
      good_service_filter = join(" AND ", [
        "metric.label.\"response_code_class\"=\"2xx\"",
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        format("resource.label.\"project_id\"=\"%s\"", var.project_id),
      ])
      total_service_filter = join(" AND ", [
        "metric.type=\"run.googleapis.com/request_count\"",
        "resource.type=\"cloud_run_revision\"",
        format("resource.label.\"project_id\"=\"%s\"", var.project_id),
      ])
    }
  }
}

This SLO is really simple: for a rolling period of 28 days, count the number of good requests (2xx status code) over the total number of requests. Nothing more, nothing less.But it doesn't do exactly what I expected (see image):

I sent a few good requests. The SLI was at 100%. Then I sent a bad request (on purpose). The SLI dropped to 80%. I then sent a couple of good requests. The SLI jumped back to 100%.

It looks like the SLI measures the "instantaneous" reliability, without taking into account the previous requests. I suspect the metrics I used aren't really time series. Looking at the run.googleapis.com/request_count, I see that it's sampled every 60s. So basically what I'm getting is a continuous measure of the reliability of my service, for 60s chunks. This is not quite a SLO.

I tried everything and read all the docs I could find, but I couldn't get my SLO to work. What I would like is (IMO) very standard: a ratio of good requests last 28 days / total requests last 28 days.

Am I missing something? Could you give me a hand please?

Impossible to setup proper request-based SLO for cloud run service