I'm trying to write some PromQL Alert Policies. Specifically I'm trying to detect when a service goes down on a Windows vm. If the VM is off, I want to ignore the absence. I'm using the Process cpu_time metric to determine if the service is on. I'm thinking I'll use cpu_utilization to see if the machine is on or not.
I have code that will narrow results down to machines that have metrics, but I have no idea how to single out the ones where the service is off, but the machine is on. The count for the service is not 0 when the service is off. The count is null and that fact seems to make this very difficult. Any suggestions?
Hi @TheZealot,
Welcome to the Google Cloud Community!
You could try these workarounds from Stack Overflow by Felipe and Grafana’s GitHub by grobie to convert null as zero
in PromQL by adding or on() vector(0)
when data is missing from the time series.
For example in:
sum( avg_over_time( custom_googleapis_com:CUSTOM_METRIC_NAME{<filtering expression>} [${__interval}]))
We can add:
sum( avg_over_time( custom_googleapis_com:CUSTOM_METRIC_NAME{<filtering expression>} [${__interval}]) or on() vector(0) )
The or on() vector(0)
needs to be positioned carefully so the boolean or
sees null
data. You may need to play around with your query a little to get the correct output.
on()
is part of one-to-one vector matching in PromQL. You can use it to reduce the number of labels if needed.
If you need more information, here are additional PromQL references:
I hope this helped!