We recently started installing the new Ops Agent, and were prompted to set up some alerts, one of which was for Disk Usage (which was a great idea, since we filled a boot drive today and didn't know about it 😞 )
Unfortunately, we got spammed about 10 minutes later by the alert as it triggered on all of the "/dev/loop" devices, which are always 100% full.
I read in the docs that tmpfs devices would be ignored, but it seems loop devices were not considered.
So, I wanted to exclude loop devices right away, instead of trying to filter them from the dashboards (which is a pain). (Conversely, the only real disk we use is root, so we could just explicitly include that, I suppose)
Unfortunately, the docs about setting up the /etc/google-cloud-ops-agent/config.yaml are not terribly clear.
Can I even do this?
If so, what would be the format?
Pointers or samples would be appreciated.
Thanks!
Dion
FYI... have you reviewed the following instructions?
https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/configuration#default
Looks like you may be able to use the follow format to exclude certain metrics:
processors:
metrics_filter:
type: exclude_metrics
metrics_pattern: []
I did review that document, yes, but it appears to only offer a method to exclude the entire metric (eg:
agent.googleapis.com/processes/*
And this document:
https://cloud.google.com/monitoring/api/metrics_opsagent#agent-disk
Hints at stuff:
bytes_used GA Disk bytes used | |
GAUGE, DOUBLE, By aws_ec2_instance, gce_instance | Current number of disk bytes used by state. Summing the values of all states yields the total available disk space. Linux only. Sampled every 60 seconds. device: Device name. state: Type of usage, one of [free, used, reserved]. |
But there's no clear example of using it (that I could find).
And, in those examples, it seems it's not giving a method to say "exclude the device", just "exclude the metric".
What am I missing?
Thanks,
Dion
I see... "exclude the device" is not supported yet in the current version of Ops Agent.
Oh, good! I'm not totally losing it.
Any suggestions on a workaround to accomplish this in the interim? These loop devices will *always* be 100% (can I send feedback somewhere to have them included with tmpfs to be ignored by default?)
Thanks,
Dion
This is a known feature request, so no need to send any feedback for now.
As far as workaround, not verified specifically for your use-case, but you may want to try "custom metrics": https://cloud.google.com/monitoring/custom-metrics/creating-metrics
Thanks
The same thing occurred to me. I got Ops Agent installed (finally, long story, required a lot of trial and error and persistence), and I was interested in monitoring disk usage. I got alerts: VM disk utilization too high. But the only alerts were for dev/loops which exceeded the 95% threshold, which is right because apparently dev/loops are mount points attached to snapd services.
I edited the policy named "VM disk utilization too high". There are three fields: filter, comparator, value. For the first, drop-down and choose "device"; for the second, drop-down and choose "!has_string" (does not have the string); and for value type in "dev/loop". Then only incidents that trigger 95% utilization AND which don't have the string dev/loop. In other words, all my incidents were for 95% and for devices that contained the string "dev/loop". With the second condition saved to the policy all my incidents were automatically solved and closed. I expect I'll only get an alert if the persistent disk utilization reaches 95%.
Hi,
Did you've tried "exclude" /dev/loop in this way ? I do not have /dev/loop device however, I'm excluding devices exactly in this way in our alerting policies.
cheers,
DamianS
So my policy looks as follow
{
"name": "projects/um-monitoring-webapp-wordpress/alertPolicies/17650320439033901987",
"displayName": "Crit disk usage",
"documentation": {},
"userLabels": {},
"conditions": [
{
"name": "projects/um-monitoring-webapp-wordpress/alertPolicies/17650320439033901987/conditions/17650320439033904538",
"displayName": "CRITICAL VM Instance - Disk utilization",
"conditionThreshold": {
"aggregations": [
{
"alignmentPeriod": "600s",
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields": [
"metric.label.device"
],
"perSeriesAligner": "ALIGN_MEAN"
}
],
"comparison": "COMPARISON_GT",
"duration": "0s",
"filter": "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/disk/percent_used\" AND (metric.labels.device != monitoring.regex.full_match(\"/dev/loop\") AND metric.labels.state = \"used\")",
"thresholdValue": 90,
"trigger": {
"count": 1
}
}
}
],
"alertStrategy": {
"autoClose": "604800s"
},
"combiner": "OR",
"enabled": true,
"notificationChannels": [
"projects/um-monitoring-webapp-wordpress/notificationChannels/13716784830169285425"
],
"creationRecord": {
"mutateTime": "2023-04-21T07:31:40.689443240Z",
"mutatedBy": "damian."
},
"mutationRecord": {
"mutateTime": "2023-04-21T07:31:40.689443240Z",
"mutatedBy": "damian."
}
}
Thanks! For me, I needed to use the !starts_with comparator instead.