Welcome to part two of this series where I’m excited to demonstrate how to implement proactive health checks to validate that your logging, search, detection, and alerting capabilities with Google Security Operations (SecOps) are working end-to-end.
In part one, I explained the importance of monitoring a data pipeline for issues, described the basic components that make up a security data pipeline, and introduced techniques for monitoring your pipeline using Google SecOps and Google Cloud. This blog post will take things a step further and show you how to validate that your data is flowing reliably and your defenses are always ready to detect & respond to threats.
A few days after implementing these health checks in my environment, I woke up to a Slack message informing me that there was a problem with my data pipeline. While I was asleep, there was a ~4 hour period where GitHub Enterprise logs weren’t being ingested into Google SecOps. None of my detection rules were being fed any logs and as a result, wouldn’t alert me if anything suspicious had happened.
With the help of a teammate, we determined that Google SecOps was successfully reading the contents of the Cloud Storage bucket where the GitHub Enterprise logs were supposed to be found, but there were no log files to ingest and Google SecOps assumed all was well. After all, not every system in an environment is logging hundreds or thousands of events an hour. Some systems are just quieter than others at certain times of the day. This is one of the reasons why detections that alert when a SIEM hasn’t received logs from a system for a certain amount of time can generate false positives that waste precious time.
The root cause of the logging disruption was a third-party incident, unrelated to Google SecOps or Google Cloud. Once the incident was resolved, we verified that the missing logs made their way to Google SecOps for ingestion and my GitHub rules generated the expected alerts retroactively.
This issue highlighted how external factors can disrupt your pipeline and create unexpected blind spots. By monitoring the components of your data pipeline, you can identify issues early on and minimize the potential impact on your security operations capabilities.
For this proof of concept, I’m going to focus on monitoring the health of my logging, search, detection, and alerting capabilities in Google SecOps for my GitHub Enterprise environment. I’m ingesting logs from my GitHub Enterprise environment into Google SecOps as follows:
Ingesting GitHub Enterprise audit logs into Google SecOps
If you’re interested in getting started with monitoring & detection for GitHub Enterprise using Google SecOps, please check out this blog post.
During the remainder of this post, I’ll walk through an example collection of health checks that carry out the following actions to monitor my GitHub Enterprise monitoring capabilities with Google SecOps.
Data pipeline health checks
Finally, we’ll create a SOAR playbook that closes any alerts generated by the health check activity.
For this project, I opted to host my health checks that I’m going to run on a regular basis in Cloud Run functions. This option is low maintenance for me meaning that I can run some Python code without worrying about the underlying infrastructure. It’s also cost effective; I only get billed for my function’s execution time.
This first health check is simple – think of it as a “ping” to ensure that my code can authenticate to GitHub’s API and carry out a basic read operation via an API call. This API call should generate an event in GitHub’s audit logs, which should make it to Google SecOps for ingestion.
The screenshot below shows the output of this health check (the Cloud Run function) in Logs Explorer in the Google Cloud console. In this example, authentication to GitHub’s API was successful and my code was able to retrieve the information for one of my GitHub organizations via the API call.
I’ve configured Cloud Scheduler to run the Cloud Run function every hour at minute 0.
Reviewing the output for the “health-check-github-ping” Cloud Run function
The Python code for this health check can be found here.
The next health check is responsible for validating that the events generated by the first health check are indexed in Google SecOps and can be searched for.
The log entry from this Cloud Run function below shows that a UDM query was executed via Google SecOps’ API.
Executing a UDM search for the GitHub event via Google SecOps’ API
And the log entry below shows that the expected GitHub Enterprise event was returned by the UDM search.
Validating that the expected GitHub log event was returned by the UDM search
This Cloud Run function is scheduled to run every hour at minute 45 with a start time for the UDM search of 1 hour ago.
The code for this health check can be found here.
Remember the logging disruption I mentioned at the beginning of this post? The example output below shows what it looks like when 0 events are returned from the UDM search.
Logging an error when 0 events are returned from the UDM search
I’ve configured a Cloud Monitoring policy to alert me via email and Slack when any of my health checks fail. The configuration of this policy is shown below.
Configuring a Cloud Monitoring policy to alert on health check job errors/failures
The third Cloud Run function validates that my health check YARA-L rule in Google SecOps generated a detection and an alert after the first health check (API call to GitHub) was executed. Alerts generated by this rule do not require a response – the purpose of this rule is simply to help validate that our logging, detection, and alerting capabilities are working end-to-end for GitHub.
A copy of this YARA-L rule is shown below. As you can see in the “events” section of this rule, we are searching for events related to my API call to GitHub and the specific name of the GitHub Personal Access Token I’m using to make the API call.
rule health_check_github_enterprise {
meta:
author = "Google Cloud Security"
description = "Detects events that are generated by a health check job and helps determine if GitHub Enterprise audit logs are being indexed as expected."
assumption = "GitHub Enterprise audit logs are being ingested into Google SecOps and a health check job has been configured to generate test events for this rule."
tags = "health check"
severity = "Info"
priority = "Info"
platform = "GitHub"
data_source = "github"
reference = "https://docs.github.com/en/enterprise-cloud@latest/admin/monitoring-activity-in-your-enterprise/reviewing-audit-logs-for-your-enterprise/audit-log-events-for-your-enterprise"
events:
$github.metadata.log_type = "GITHUB"
$github.metadata.product_name = "GITHUB"
$github.metadata.product_event_type = "api.request"
$github.extracted.fields["org"] = "threatpunter1"
$github.network.http.method = "GET"
$github.extracted.fields["user_programmatic_access_name"] = "github-api-health-check"
outcome:
$risk_score = max(10)
condition:
$github
}
The output from the third health check shows that a detection and an alert was generated by my YARA-L rule. At this point, we’ve validated that our logging, search, detection, and alerting capabilities are working end-to-end for GitHub Enterprise with Google SecOps.
This Cloud Run function is also scheduled to run every hour at minute 45 with a start time for the detection and alerts searches of 1 hour ago.
Searching for detections and alerts generated by a rule using Google SecOps’ API
Below is an alert generated by the health check rule. Note, a human response is only required if any of our health checks fail. No response is required for this alert – we’ll take care of closing these alerts out automatically in the next section of this post.
Reviewing an alert from the health check rule in Google SecOps
The log entries below show what it looks like when this health check fails (i.e. no detections were generated by the health check rule in Google SecOps). When any of these health checks fail, the security team will receive an alert to investigate the issue with their data pipeline and ensure it gets fixed quickly before their monitoring, detection, investigation, and other workflows are impacted.
Reviewing an error for the health check, “health-check-github-validate-alert-generation”
The code for this health check can be found here.
Action is only required from the security team if something is wrong with the data pipeline (i.e., logs aren’t flowing correctly or the health check rule isn’t being triggered). The SOAR playbook shown below automatically tags appropriate cases with “health-check” and closes alerts that are generated by the health check rule.
Creating a SOAR playbook to tag and close alerts generated by health check activity
Reviewing the case below shows that it was tagged by the SOAR playbook and that the alert that’s attached to the case is closed.
Reviewing a case in SOAR that has been automatically tagged and closed
That’s it for this blog series where I walked through some practical techniques for monitoring the health of your security data pipeline. This methodology can be expanded to cover other monitored systems & data sources to ensure that your defenses are always ready to detect & respond to threats.
Here’s a summary of what was covered:
Special thanks to the following people for sharing their valuable feedback and expertise: Dan Dye, Serhat Gülbetekin, Ermyas Haile, Dave Herrald, Utsav Lathia, Christopher Martin, Othmane Moustaid, and John Stoner.