Re: Disk utilization Monitoring Policy false posit...

hyperrat · 04-22-2025 08:43 AM

Hi

I have a policy in a number of projects that checks for physical disks that are over 90% used. Loop disks are filtered out.

On a weekly basis, some of the projects get open Incidents created for what appears to be no good reason. Some disks are pretty static like the EFI partition and are no where near (e.g. 5% of) the 90% threshold. These incidents won't automatically close for many days. I don't think that any of these are getting to the 90% threshold. The closest is the root partition which is normally around 50% but data is mounted on a second disk.

Now the strange part... if I fill a disk deliberately to test the policy, it triggers fine, opens an incident and when I clean up the disk it automatically closes the incident after a while. There is only one disk utilization policy in the project.

Could there be a misconfiguration in my policy ?

Here is the one disk utilization policy JSON:

{
  "name": "projects/project-a/alertPolicies/123456789",
  "displayName": "project-a low-server-disk-alert",
  "documentation": {
    "content": "A disk of a server project-a has low free disk space.",
    "mimeType": "text/markdown"
  },
  "userLabels": {},
  "conditions": [
    {
      "name": "projects/project-a/alertPolicies/123456789/conditions/987654321",
      "displayName": "VM Instance - disk utilization",
      "conditionThreshold": {
        "aggregations": [
          {
            "alignmentPeriod": "900s",
            "perSeriesAligner": "ALIGN_MEAN"
          }
        ],
        "comparison": "COMPARISON_GT",
        "duration": "0s",
        "filter": "resource.type = \"gce_instance\" AND metric.type = \"agent.googleapis.com/disk/percent_used\" AND (metric.labels.device != starts_with(\"/dev/loop\") AND metric.labels.state = \"used\")",
        "thresholdValue": 90,
        "trigger": {
          "count": 1
        }
      }
    }
  ],
  "alertStrategy": {
    "notificationPrompts": [
      "OPENED"
    ]
  },
  "combiner": "OR",
  "enabled": true,
  "notificationChannels": [
    "projects/project-a/notificationChannels/5647382910"
  ],
  "creationRecord": {
    "mutateTime": "2024-12-02T08:55:57.129430088Z",
    "mutatedBy": "someadmin.iam.gserviceaccount.com"
  },
  "mutationRecord": {
    "mutateTime": "2025-04-22T14:54:57.919785586Z",
    "mutatedBy": "someotheradmin"
  },
  "severity": "WARNING"
}

Thanks

yogananthr

Hello @hyperrat ,

When a false positive alert is raised in GCP, you can follow these steps to figure out why it happened and make adjustments to prevent it in the future:

Find the Triggered Incident:
Go to Monitoring > Alerting in the GCP Console. Look for the specific alert policy that caused the false positive. Open it, and then click on the Incident Summary to view details about what happened.
Check the Logs for Clues:
In the Incident Summary, you’ll find logs related to the alert. Focus on two key pieces of information:
- terse_message: This is a quick summary of why the alert was triggered. It’s helpful for getting a high-level understanding of the issue.
- verbose_message: This provides more detailed information about the metric values, conditions, and why the policy thought something was wrong.
Look at the Metric Data:
Verify the actual disk usage at the time the alert was triggered. Was the reported value (e.g., disk utilization) accurate, or did something unexpected happen? If the values seem off, there might be an issue with the monitoring agent or the metric itself.
Review the Policy Filter:
Make sure the filter in your policy is correctly set up. It should exclude irrelevant metrics like loop devices or static partitions that don’t change much. If you notice a pattern in the false positives (e.g., the same disk causing issues), you can tweak the filter to ignore those specific cases.
Update the Alert Policy:
Based on your findings, edit the alert policy JSON to fix the issue. Here are some changes you might consider:
- Exclude problematic disks by refining the filter conditions.
- Increase the evaluation duration (e.g., from 300 seconds to 600 seconds) to prevent temporary spikes from triggering alerts.
- Adjust the aggregation method if needed, like switching from ALIGN_MEAN to ALIGN_MAX.
Test the Updates:
Once you’ve made changes, test the policy to make sure it behaves as expected. You can simulate high disk usage to confirm that the alert triggers only when it should and resolves automatically when the issue is fixed.

Disk utilization Monitoring Policy false positives