Ops Agent - logging loop devices - how to stop?

dvansevenant · 11-12-2021 01:49 PM

We recently started installing the new Ops Agent, and were prompted to set up some alerts, one of which was for Disk Usage (which was a great idea, since we filled a boot drive today and didn't know about it 😞 )

Unfortunately, we got spammed about 10 minutes later by the alert as it triggered on all of the "/dev/loop" devices, which are always 100% full.

I read in the docs that tmpfs devices would be ignored, but it seems loop devices were not considered.

So, I wanted to exclude loop devices right away, instead of trying to filter them from the dashboards (which is a pain). (Conversely, the only real disk we use is root, so we could just explicitly include that, I suppose)

Unfortunately, the docs about setting up the /etc/google-cloud-ops-agent/config.yaml are not terribly clear.

Can I even do this?

If so, what would be the format?

Pointers or samples would be appreciated.

Thanks!

Dion

JunOps

FYI... have you reviewed the following instructions?

https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/configuration#default

Looks like you may be able to use the follow format to exclude certain metrics:

processors:
    metrics_filter:
      type: exclude_metrics
      metrics_pattern: []

dvansevenant

I did review that document, yes, but it appears to only offer a method to exclude the entire metric (eg:

agent.googleapis.com/processes/*

And this document:

https://cloud.google.com/monitoring/api/metrics_opsagent#agent-disk

Hints at stuff:

bytes_used GA Disk bytes used
GAUGE, DOUBLE, By aws_ec2_instance, gce_instance	Current number of disk bytes used by state. Summing the values of all states yields the total available disk space. Linux only. Sampled every 60 seconds. device: Device name. state: Type of usage, one of [free, used, reserved].

But there's no clear example of using it (that I could find).

And, in those examples, it seems it's not giving a method to say "exclude the device", just "exclude the metric".

What am I missing?

Thanks,

Dion

JunOps

I see... "exclude the device" is not supported yet in the current version of Ops Agent.

dvansevenant

Oh, good! I'm not totally losing it.

Any suggestions on a workaround to accomplish this in the interim? These loop devices will *always* be 100% (can I send feedback somewhere to have them included with tmpfs to be ignored by default?)

Thanks,

Dion

JunOps

This is a known feature request, so no need to send any feedback for now.

As far as workaround, not verified specifically for your use-case, but you may want to try "custom metrics": https://cloud.google.com/monitoring/custom-metrics/creating-metrics

Thanks

urbanx

The same thing occurred to me. I got Ops Agent installed (finally, long story, required a lot of trial and error and persistence), and I was interested in monitoring disk usage. I got alerts: VM disk utilization too high. But the only alerts were for dev/loops which exceeded the 95% threshold, which is right because apparently dev/loops are mount points attached to snapd services.

I edited the policy named "VM disk utilization too high". There are three fields: filter, comparator, value. For the first, drop-down and choose "device"; for the second, drop-down and choose "!has_string" (does not have the string); and for value type in "dev/loop". Then only incidents that trigger 95% utilization AND which don't have the string dev/loop. In other words, all my incidents were for 95% and for devices that contained the string "dev/loop". With the second condition saved to the policy all my incidents were automatically solved and closed. I expect I'll only get an alert if the persistent disk utilization reaches 95%.

DamianS

Hi,

Did you've tried "exclude" /dev/loop in this way ? I do not have /dev/loop device however, I'm excluding devices exactly in this way in our alerting policies.

cheers,
DamianS

DamianS

So my policy looks as follow

{
  "name": "projects/um-monitoring-webapp-wordpress/alertPolicies/17650320439033901987",
  "displayName": "Crit disk usage",
  "documentation": {},
  "userLabels": {},
  "conditions": [
    {
      "name": "projects/um-monitoring-webapp-wordpress/alertPolicies/17650320439033901987/conditions/17650320439033904538",
      "displayName": "CRITICAL VM Instance - Disk utilization",
      "conditionThreshold": {
        "aggregations": [
          {
            "alignmentPeriod": "600s",
            "crossSeriesReducer": "REDUCE_MEAN",
            "groupByFields": [
              "metric.label.device"
            ],
            "perSeriesAligner": "ALIGN_MEAN"
          }
        ],
        "comparison": "COMPARISON_GT",
        "duration": "0s",
        "filter": "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/disk/percent_used\" AND (metric.labels.device != monitoring.regex.full_match(\"/dev/loop\") AND metric.labels.state = \"used\")",
        "thresholdValue": 90,
        "trigger": {
          "count": 1
        }
      }
    }
  ],
  "alertStrategy": {
    "autoClose": "604800s"
  },
  "combiner": "OR",
  "enabled": true,
  "notificationChannels": [
    "projects/um-monitoring-webapp-wordpress/notificationChannels/13716784830169285425"
  ],
  "creationRecord": {
    "mutateTime": "2023-04-21T07:31:40.689443240Z",
    "mutatedBy": "damian."
  },
  "mutationRecord": {
    "mutateTime": "2023-04-21T07:31:40.689443240Z",
    "mutatedBy": "damian."
  }
}

vitvitvit

Thanks! For me, I needed to use the !starts_with comparator instead.