Re: gce memory monitoring

gcp888 · 07-15-2024 07:30 AM

I am trying to use MQL query to create monitoring dashboard for list of VMs. I was able to get the info i needed fgor cpu but for memory I only see

compute.googleapis.com/instance/memory/balloon/ram_size

ram_used

swap_in_bytes_count

swap_out_bytes_count

any idea how i can get actual memory utilization on the VM?

alexmoore

Hi

This is because out of the box the metrics provided are what the hypervisor sees. If you want to see in guest metrics from within the virtual machine, then you'll need to deploy the Google Cloud Ops Agent:

https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent

https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/installation

This will then provide a wealth of additional metrics from within the operating system, including memory

https://cloud.google.com/monitoring/api/metrics_opsagent#agent-memory

But also as you'll see on the above page, it supports a range of application integrations too for even more insights.

Hope that helps,

Alex

gcp888

hi I tried but the agent says

Pending: Ops Agent is installing.

google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2024-07-15 21:51:03 UTC; 9s ago
Process: 1772 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Process: 1765 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
Main PID: 1772 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 21983)
Memory: 0B
CGroup: /system.slice/google-cloud-ops-agent.service

Jul 15 21:50:57 test google_cloud_ops_agent_engine[1765]: pipelines:
Jul 15 21:50:57 test google_cloud_ops_agent_engine[1765]: default_pipeline:
Jul 15 21:50:57 test google_cloud_ops_agent_engine[1765]: receivers: [hostmetrics]
Jul 15 21:50:57 test google_cloud_ops_agent_engine[1765]: processors: [metrics_filter]
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1765]: 2024/07/15 21:51:03 [Ports Check] Result: PASS
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1765]: 2024/07/15 21:51:03 [Network Check] Result: PASS
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1765]: 2024/07/15 21:51:03 [API Check] Result: FAIL, Error code: MonApiUnauthenticatedErr, Failure: The current VM couldn't >
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1765]: 2024/07/15 21:51:03 [API Check] Result: FAIL, Error code: LogApiUnauthenticatedErr, Failure: The current VM couldn't >
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1765]: 2024/07/15 21:51:03 Startup checks finished
Jul 15 21:51:03 test systemd[1]: Started Google Cloud Ops Agent.

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d
└─directories.conf
Active: active (running) since Mon 2024-07-15 21:51:03 UTC; 9s ago
Process: 1783 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 1774 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY>
Main PID: 1784 (otelopscol)
Tasks: 6 (limit: 21983)
Memory: 32.6M
CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
└─1784 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml

Jul 15 21:51:04 test otelopscol[1784]: 2024-07-15T21:51:04.129Z info prometheusreceiver@v0.102.0/metrics_receiver.go:257 Scrape job added {"jobName>
Jul 15 21:51:04 test otelopscol[1784]: 2024-07-15T21:51:04.129Z info service@v0.102.0/service.go:206 Everything is ready. Begin running and processing dat>
Jul 15 21:51:04 test otelopscol[1784]: 2024-07-15T21:51:04.130Z info prometheusreceiver@v0.102.0/metrics_receiver.go:344 Starting scrape manager
Jul 15 21:51:05 test otelopscol[1784]: 2024-07-15T21:51:05.181Z error exporterhelper/queue_sender.go:101 Exporting failed. Dropping data. {"error":>
Jul 15 21:51:05 test otelopscol[1784]: go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
Jul 15 21:51:05 test otelopscol[1784]: /root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/exporterhelper/queue_sender.go:101
Jul 15 21:51:05 test otelopscol[1784]: go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
Jul 15 21:51:05 test otelopscol[1784]: /root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/bounded_memory_queue.go:52
Jul 15 21:51:05 test otelopscol[1784]: go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
Jul 15 21:51:05 test otelopscol[1784]: /root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.102.0/internal/queue/consumers.go:43

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
└─directories.conf
Active: active (running) since Mon 2024-07-15 21:51:04 UTC; 9s ago
Process: 1792 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY}/subagents (code=exited, status=0/SUCCESS)
Process: 1773 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRE>
Main PID: 1793 (google_cloud_op)
Tasks: 29 (limit: 21983)
Memory: 31.3M
CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
├─1793 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_wrapper -config_path /etc/google-cloud-ops-agent/config.yaml -log_path /var/log/google-cloud-ops-a>
└─1798 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf --parser /run/google-clo>

Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: processors:
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: metrics_filter:
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: type: exclude_metrics
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: metrics_pattern: []
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: service:
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: pipelines:
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: default_pipeline:
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: receivers: [hostmetrics]
Jul 15 21:51:03 test google_cloud_ops_agent_engine[1773]: processors: [metrics_filter]
Jul 15 21:51:04 test systemd[1]: Started Google Cloud Ops Agent - Logging Agent.

● google-cloud-ops-agent-diagnostics.service - Google Cloud Ops Agent - Diagnostics
Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-diagnostics.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2024-07-15 21:50:53 UTC; 20s ago
Main PID: 1757 (google_cloud_op)
Tasks: 6 (limit: 21983)
Memory: 21.8M
CGroup: /system.slice/google-cloud-ops-agent-diagnostics.service
└─1757 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_diagnostics -config /etc/google-cloud-ops-agent/config.yaml

Jul 15 21:50:53 test systemd[1]: google-cloud-ops-agent-diagnostics.service: Succeeded.
Jul 15 21:50:53 test systemd[1]: Stopped Google Cloud Ops Agent - Diagnostics.
Jul 15 21:50:53 test systemd[1]: Started Google Cloud Ops Agent - Diagnostics.
Jul 15 21:51:03 test google_cloud_ops_agent_diagnostics[1757]: 2024/07/15 21:51:03 rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: metad>
Jul 15 21:51:03 test google_cloud_ops_agent_diagnostics[1757]: 2024/07/15 21:51:03 rpc error: code = Unauthenticated desc = transport: per-RPC cr

alexmoore

My assumption from the log is that the VM doesn't have permission to write telemetry data, have a read through this page for more details on this point:

https://cloud.google.com/monitoring/agent/ops-agent/authorization

gcp888

i have the ops agent installed. but from my monitoring project when i do mql query on the instance I am still seeing the same stats

compute.googleapis.com/instance/memory/balloon/ram_size

ram_used

swap_in_bytes_count

swap_out_bytes_count

any idea how to get memory used?

alexmoore

Have a look under the agent metrics, check out:

https://cloud.google.com/monitoring/api/metrics_opsagent#agent-memory

gcp888

tried doing mql query but I dont see a filter for resource.instance_name

only resource.instance_id which is hard to identify what server I am working with

alexmoore

You can do something like:

fetch gce_instance
| metric 'agent.googleapis.com/memory/percent_used'
| filter (resource.project_id == 'my_project_name')
| filter (metadata.system_labels.name == 'my_instance_name')
| filter (resource.zone == 'my_zone')
| filter (metric.state == 'used')
| group_by 2m, [value_percent_used_mean: mean(value.percent_used)]
| every 2m

gcp888

i tried this but it says no data is available for the selected time frame

alexmoore

What have you tried? Have you confirmed there was data?