Kubernetes cluster has overcommitted CPU resource requests

Hello,

Every night, around always the same time (2AM/3AM UTC), our K8s cluster triggers a DataDog alert of overcommitted CPU resources. Funny thing, during that time the data load processing is as low as possible and there is no evidence of high peak load.

The DD alert is based on:

avg(last_10m):sum:kubernetes_state.container.cpu_requested{project:infinity} by {cluster-name} / sum:kubernetes_state.node.cpu_capacity{project:infinity} by {cluster-name} > 1

Any idea about this behaviour?

Thanks

4 1 61
1 REPLY 1

Hi @Mett6464 ,

Welcome on Google Cloud Community. Did you've checked if you would be able to configure similar metric and alert policy within Stackdriver or PromQL? It looks like ( this is my assumption ) that DataDog Node agent sending false positive alerts. I had similar issue with DD and Openshift on IBM Cloud. I've also got few times false positive without any reason. I've changed monitoring to Sysdig with PromQL queries and since that time, I don't have any issues.  I'll try to reproduce this at my environment and let you know. 

cheers,
DamianS