Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

APIGEE Latency SLO metric. Which metric to use and how to set up SLO?

Hi, I want to set up 99% percentile SLO Latency for APIGEE proxy.

There are 2 relevant metrics available:

  • apigee.googleapis.com/proxy/latencies (Kind: delta - Value Type: distribution)
  • apigee.googleapis.com/proxyv2/latencies_percentile (Kind: Gauge - Value Type: Double)

The `apigee.googleapis.com/proxy/latencies` metric has latency percentile bucket for 99% percentile (in metric explorers). Hence, i think it might be the correct one to use. But in the SLO setup, theres no option to choose 99% aggregation as shown in the image. It only has the option of specifying the range

Screenshot 2024-11-25 at 2.40.57 PM.png

 

The other metric also feels right but when choosing it as SLI, theres only aggregation of sum and average. 

1 3 793
3 REPLIES 3

I am not sure exactly how we can do this. I've been asking my peers to get some advice on this.


@minh-nguyen wrote:

The `apigee.googleapis.com/proxy/latencies` metric has latency percentile bucket for 99% percentile (in metric explorers). Hence, i think it might be the correct one to use. But in the SLO setup, theres no option to choose 99% aggregation as shown in the image. It only has the option of specifying the range


I understand your question, and it sounds reasonable to me. The apigee.googleapis.com/proxy/latencies is the current metric - the one you probably want to use. But I see your point, I don't know how to select the p99 metric for the SLI . 

This feels to me, to be an issue that is not Apigee specific.  It may be an issue that needs to be addressed in the Cloud Monitoring suite.  I'll let you know what I learn.

Following up on this. 

When you are setting an SLO  (and the required SLI) in Google Cloud Monitoring (Observability), you should use apigee.googleapis.com/proxy/latencies. This is a cumulative histogram for Prometheus. (here is a good introduction for histograms in Prometheus: https://www.youtube.com/watch?v=yYbXak-1hew )

On the SLI segment of the wizard, set the performance metric to be the target.  500ms, 700ms, whatever you like.  

Then, in the SLO section, specify the goal, the percentage of calls you would like to be handled below the specified time threshold. In your case you want p99, so you should set 99% there.

GCP Monitoring will compute the 99th percentile and manage the SLO accordingly.  


@minh-nguyen wrote:

The `apigee.googleapis.com/proxy/latencies` metric has latency percentile bucket for 99% percentile (in metric explorers). Hence, i think it might be the correct one to use.


Yes.  I think you are talking about this: 

dchiesa1_0-1732666924435.png

But, the 99 / 95 / 50 / 5 options in that dropdown are ... contrived. There's nothing in the dataset that says "you must select one of these percentiles".  The way the data is collected and tracked by Apigee should allow Cloud Monitoring allows to use any percentile... 98, 90, 75, whatever.  It's just the designers of the UI pre-selected those thresholds.   

Something similar happened with Alerting. The UI designers give you a selection of options. 

dchiesa1_1-1732667229959.png

But those are not the only values.

In the SLO/SLI, you have free reign to select the threshold you like. And Monitoring will do the computation based on that histogram data. 

Does  this answer your question? 

 

 

Hi @dchiesa1 , thank you so much for getting back to me. I have another question. 

The `apigee.googleapis.com/proxy/latencies` metric is only for request-based SLO according to this article https://cloud.google.com/stackdriver/docs/solutions/slo-monitoring/api/identifying-custom-sli because its a DELTA metric.

If i use `apigee.googleapis.com/proxy/latencies` in window based SLO, I can see the type of SLO becomes both request and window based. Is that correct set up?

Screenshot 2024-11-27 at 11.52.18 AM.png

From my understanding, the SLI calculation will be percentage of good request during 1 minute window. And then on overall SLO period, get percentage of good windows 

By that logic, i think the set up should be ok?