A Data Pipeline Disruption

David-French · 09-25-2024 07:03 AM

Welcome to part two of this series where I’m excited to demonstrate how to implement proactive health checks to validate that your logging, search, detection, and alerting capabilities with Google Security Operations (SecOps) are working end-to-end.

In part one, I explained the importance of monitoring a data pipeline for issues, described the basic components that make up a security data pipeline, and introduced techniques for monitoring your pipeline using Google SecOps and Google Cloud. This blog post will take things a step further and show you how to validate that your data is flowing reliably and your defenses are always ready to detect & respond to threats.

A Data Pipeline Disruption

A few days after implementing these health checks in my environment, I woke up to a Slack message informing me that there was a problem with my data pipeline. While I was asleep, there was a ~4 hour period where GitHub Enterprise logs weren’t being ingested into Google SecOps. None of my detection rules were being fed any logs and as a result, wouldn’t alert me if anything suspicious had happened.

slack-notification-health-check-error (1).png

With the help of a teammate, we determined that Google SecOps was successfully reading the contents of the Cloud Storage bucket where the GitHub Enterprise logs were supposed to be found, but there were no log files to ingest and Google SecOps assumed all was well. After all, not every system in an environment is logging hundreds or thousands of events an hour. Some systems are just quieter than others at certain times of the day. This is one of the reasons why detections that alert when a SIEM hasn’t received logs from a system for a certain amount of time can generate false positives that waste precious time.

The root cause of the logging disruption was a third-party incident, unrelated to Google SecOps or Google Cloud. Once the incident was resolved, we verified that the missing logs made their way to Google SecOps for ingestion and my GitHub rules generated the expected alerts retroactively.

This issue highlighted how external factors can disrupt your pipeline and create unexpected blind spots. By monitoring the components of your data pipeline, you can identify issues early on and minimize the potential impact on your security operations capabilities.

Implementing Proactive Data Pipeline Health Checks

For this proof of concept, I’m going to focus on monitoring the health of my logging, search, detection, and alerting capabilities in Google SecOps for my GitHub Enterprise environment. I’m ingesting logs from my GitHub Enterprise environment into Google SecOps as follows:

GitHub Enterprise is configured to stream its audit logs to a Google Cloud storage bucket
A feed is configured in Google SecOps to ingest the GitHub Enterprise logs from the Google Cloud storage bucket into the “GITHUB” log type

Ingesting GitHub Enterprise audit logs into Google SecOps

If you’re interested in getting started with monitoring & detection for GitHub Enterprise using Google SecOps, please check out this blog post.

During the remainder of this post, I’ll walk through an example collection of health checks that carry out the following actions to monitor my GitHub Enterprise monitoring capabilities with Google SecOps.

Confirm that authentication to GitHub’s API is working and that a small, basic read operation can be performed via an API call
Validate that the expected GitHub Enterprise audit events generated by step one were indexed in Google SecOps and can be searched for
Validate that a health check rule in Google SecOps generates a detection and an alert

Data pipeline health checks

Finally, we’ll create a SOAR playbook that closes any alerts generated by the health check activity.

Step 1: Validating API Connectivity

For this project, I opted to host my health checks that I’m going to run on a regular basis in Cloud Run functions. This option is low maintenance for me meaning that I can run some Python code without worrying about the underlying infrastructure. It’s also cost effective; I only get billed for my function’s execution time.

This first health check is simple – think of it as a “ping” to ensure that my code can authenticate to GitHub’s API and carry out a basic read operation via an API call. This API call should generate an event in GitHub’s audit logs, which should make it to Google SecOps for ingestion.

The screenshot below shows the output of this health check (the Cloud Run function) in Logs Explorer in the Google Cloud console. In this example, authentication to GitHub’s API was successful and my code was able to retrieve the information for one of my GitHub organizations via the API call.

I’ve configured Cloud Scheduler to run the Cloud Run function every hour at minute 0.

Reviewing the output for the “health-check-github-ping” Cloud Run function

The Python code for this health check can be found here.

Step 2: Validating Log Ingestion

The next health check is responsible for validating that the events generated by the first health check are indexed in Google SecOps and can be searched for.

The log entry from this Cloud Run function below shows that a UDM query was executed via Google SecOps’ API.

Executing a UDM search for the GitHub event via Google SecOps’ API

And the log entry below shows that the expected GitHub Enterprise event was returned by the UDM search.

Validating that the expected GitHub log event was returned by the UDM search

This Cloud Run function is scheduled to run every hour at minute 45 with a start time for the UDM search of 1 hour ago.

The code for this health check can be found here.

Remember the logging disruption I mentioned at the beginning of this post? The example output below shows what it looks like when 0 events are returned from the UDM search.

Logging an error when 0 events are returned from the UDM search

I’ve configured a Cloud Monitoring policy to alert me via email and Slack when any of my health checks fail. The configuration of this policy is shown below.

Configuring a Cloud Monitoring policy to alert on health check job errors/failures

Step 3: Validating Alert Generation

The third Cloud Run function validates that my health check YARA-L rule in Google SecOps generated a detection and an alert after the first health check (API call to GitHub) was executed. Alerts generated by this rule do not require a response – the purpose of this rule is simply to help validate that our logging, detection, and alerting capabilities are working end-to-end for GitHub.

A copy of this YARA-L rule is shown below. As you can see in the “events” section of this rule, we are searching for events related to my API call to GitHub and the specific name of the GitHub Personal Access Token I’m using to make the API call.

rule health_check_github_enterprise {

  meta:
    author = "Google Cloud Security"
    description = "Detects events that are generated by a health check job and helps determine if GitHub Enterprise audit logs are being indexed as expected."
    assumption = "GitHub Enterprise audit logs are being ingested into Google SecOps and a health check job has been configured to generate test events for this rule."
    tags = "health check"
    severity = "Info"
    priority = "Info"
    platform = "GitHub"
    data_source = "github"
    reference = "https://docs.github.com/en/enterprise-cloud@latest/admin/monitoring-activity-in-your-enterprise/reviewing-audit-logs-for-your-enterprise/audit-log-events-for-your-enterprise"

  events:
    $github.metadata.log_type = "GITHUB"
    $github.metadata.product_name = "GITHUB"
    $github.metadata.product_event_type = "api.request"
    $github.extracted.fields["org"] = "threatpunter1"
    $github.network.http.method = "GET"
    $github.extracted.fields["user_programmatic_access_name"] = "github-api-health-check"

  outcome:
    $risk_score = max(10)

  condition:
    $github
}

The output from the third health check shows that a detection and an alert was generated by my YARA-L rule. At this point, we’ve validated that our logging, search, detection, and alerting capabilities are working end-to-end for GitHub Enterprise with Google SecOps.

This Cloud Run function is also scheduled to run every hour at minute 45 with a start time for the detection and alerts searches of 1 hour ago.

Searching for detections and alerts generated by a rule using Google SecOps’ API

Below is an alert generated by the health check rule. Note, a human response is only required if any of our health checks fail. No response is required for this alert – we’ll take care of closing these alerts out automatically in the next section of this post.

Reviewing an alert from the health check rule in Google SecOps

The log entries below show what it looks like when this health check fails (i.e. no detections were generated by the health check rule in Google SecOps). When any of these health checks fail, the security team will receive an alert to investigate the issue with their data pipeline and ensure it gets fixed quickly before their monitoring, detection, investigation, and other workflows are impacted.

Reviewing an error for the health check, “health-check-github-validate-alert-generation”

The code for this health check can be found here.

Step 4: Closing Alerts Using SOAR

Action is only required from the security team if something is wrong with the data pipeline (i.e., logs aren’t flowing correctly or the health check rule isn’t being triggered). The SOAR playbook shown below automatically tags appropriate cases with “health-check” and closes alerts that are generated by the health check rule.

Creating a SOAR playbook to tag and close alerts generated by health check activity

Reviewing the case below shows that it was tagged by the SOAR playbook and that the alert that’s attached to the case is closed.

Reviewing a case in SOAR that has been automatically tagged and closed

Wrap Up

That’s it for this blog series where I walked through some practical techniques for monitoring the health of your security data pipeline. This methodology can be expanded to cover other monitored systems & data sources to ensure that your defenses are always ready to detect & respond to threats.

Here’s a summary of what was covered:

The importance of monitoring a data pipeline for issues before a security team is blindsided by missing logs or malicious activity going unnoticed
Core components that make up a basic security data pipeline
Techniques for monitoring your data pipeline for issues using Google SecOps and Google Cloud
How to implement proactive health checks that generate alerts when important components of your data pipeline fail
Creating a SOAR playbook to automatically close alerts generated by health check activity

Additional Reading

Detection Engineering Demystified: Building Custom Detections for GitHub Enterprise (Recording, Slides) – David French
Chronicle Forwarder Telemetry via Google Cloud Monitoring – Chris Martin
Building Better Hunt Data – Josh Liburdi
Chronicle Ingestion Stats & Metrics – Chris Martin
Using Cloud Monitoring for ingestion notifications
Enrichment in Google SecOps – Chris Martin

Acknowledgements

Special thanks to the following people for sharing their valuable feedback and expertise: Dan Dye, Serhat Gülbetekin, Ermyas Haile, Dave Herrald, Utsav Lathia, Christopher Martin, Othmane Moustaid, and John Stoner.

Check out the new Professional Security Operations Engineer certification beta!

Practical Techniques for Monitoring Your Security Data Pipeline (2 of 2)

A Data Pipeline Disruption

Implementing Proactive Data Pipeline Health Checks

Step 1: Validating API Connectivity

Step 2: Validating Log Ingestion

Step 3: Validating Alert Generation

Step 4: Closing Alerts Using SOAR

Wrap Up

Additional Reading

Acknowledgements