Hi Everyone,
we are trying to configure alert setup where we want to trigger alerts based on the error percentage. Based on the total traffic volume received on the API, we want to define a percentage rate and any particular error scenario which crosses this limit , an alerts should be triggered.
we have tried rolling window "percentage change" and "rate" functions, but these does not meet our requirement.
Anyone who has done similar setup, could you please sugegst.
@dchiesa1 any suggestions.
Thank you!
Hi @nishucs667! We appreciate your participation in the community! Rest assured, we’re monitoring this thread - We also encourage fellow members to weigh in if they can offer guidance 🙂
@ssvaidyanathan the alerts_examples doc shows how to create alerts based on
But none of the examples show how to create an alert based on error rate percentage. Eg, 4xx responses exceed 8% of calls, for a given time period. And I think that was the original request.
Is there an example for how to configure an alert to trigger when the 4xx responses exceed a given threshold ?
EDIT: I created a doc bug (internal ref b/382512092) asking for documentation on this.
thanks @dchiesa1 . you are correct. Actual ask was to trigger alerts when any specific error crosses a % value (% limit value defined during alert setup) of total volume receive for that API for a specific time interval.
for example: if an api receives 10k calls in 2 hrs then i want to define an error % (let's say 5% for 4xx failures). so when total errors % crosses 5% of total volume within 2 hrs then an alert should be triggered.
yes. Did you see my other reply, with a screencast, showing how to do this?
Yes @dchiesa1 . I am trying to test this out. and see if everything works . Thanks for helping out. 🙂
I recorded this screencast showing how to use proxy metrics to get alerts on 4xx errors.
currently in the Prometheus engine which powers Google Cloud Monitoring (which is the subsystem that manages alerts for Apigee hybrid and X), there is no support for comparison operators like greater-than or lesser-than-or-equal (etc), which would apply to int labels. Also, There is no support for "casting" int labels to string values, which would allow us to use regex. (See this related issue) . As a result, we need to sum the error rates across multiple different 4xx error codes to get the aggregate.
The approach I used relies on this PromQL to sum the error rates for the various response_code values:
sum(
rate(apigee_googleapis_com:proxy_response_count{monitored_resource="apigee.googleapis.com/Proxy",response_code="400"}[5m]) or
rate(apigee_googleapis_com:proxy_response_count{monitored_resource="apigee.googleapis.com/Proxy",response_code="429"}[5m]) or
rate(apigee_googleapis_com:proxy_response_count{monitored_resource="apigee.googleapis.com/Proxy",response_code="418"}[5m]) or
rate(apigee_googleapis_com:proxy_response_count{monitored_resource="apigee.googleapis.com/Proxy",response_code="403"}[5m])
)
/
sum (
rate(
apigee_googleapis_com:proxy_response_count{monitored_resource="apigee.googleapis.com/Proxy"}[5m]
)
) * 100