New to Google SecOps: Turning Strings into Integers for Statistical Analysis

jstoner · 09-04-2024 08:30 AM

Today, I’d like to draw your attention to a couple of string functions that can be helpful during threat hunting and can be used when building detections. These two functions take in string values and output integers which can be then used on their own or in concert with other functions, including the statistical functions we discussed in our previous blog.

The two string functions we are going to cover are strings.count_substrings and strings.length. Let’s take a look at each one.

strings.count_substrings

There are times when we need to count the number of occurrences where a string pattern is present in a field or variable. A very basic example of this would be finding process launch events that have the pattern of c:\windows\ in the target.process.command_line field more than once.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line !=""
strings.count_substrings(strings.to_lower(target.process.command_line), "c:\\windows\\") > 1

Notice that we used a nested strings.to_lower function for the command line so that we can ensure that our pattern match is the same case. Because you can choose to have case sensitivity set on or off in search, having this function in place is a nice way to ensure that you are getting the results you are expecting.

In the command line field, we can see that rundll32.exe executes from the c:\windows\system32\ directory and calls another dll in that same directory. There are two substrings that match our criteria, so we return those events.

Let’s take a look at another example. Perhaps we want to determine if a command switch, like /domain, is being called multiple times in a command line. If a set of commands are being strung together and they are using that same command switch, we want to see those events.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line !=""
strings.count_substrings(strings.to_lower(target.process.command_line), "/domain") > 1

Our search looks very similar to the prior example, but with our pattern we see that the cmd.exe called a series of commands including net user and net group commands that used the /domain switch.

While we can use the function to narrow the result set, we can also apply aggregation to our search. Let’s group the result set by the command line and hostname and generate a count of the number of times /domain was seen in a command line, sorted from greatest to least.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line != ""
target.process.command_line = $cmdline
principal.hostname = $hostname
match:
 $cmdline, $hostname
outcome:
 $string_count = max(strings.count_substrings(strings.to_lower(target.process.command_line), "/domain"))
order: $string_count desc
limit: 10

Based on the results, the host wrk-pacman has two command lines that contain this pattern multiple times.

Alternatively, we could group by the hostname and then generate a sum of the number of times that /domain was referenced during the timeframe of our search. Notice that we removed the command line from the match section and because we want the total number of references rather than the number of events the switch is found in, we are using the aggregation function of sum instead of count.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line !=""
target.process.command_line = $cmdline
principal.hostname = $hostname
match:
 $hostname
outcome:
 $string_sum = sum(strings.count_substrings(strings.to_lower(target.process.command_line), "/domain"))
order: $string_sum desc
limit: 10

Now we see that two systems have a number of references in their command line process events referencing /domain which provides us an opportunity to further focus our investigation.

We can also use the strings.count_substring function in rules. In this example, I converted our last search and set a threshold to identify systems that in a five minute time window are seeing more than 15 references to the command switch domain in my dataset. If this is an admin issuing these commands as part of their normal work, this may be completely normal, but if we are seeing this large frequency of command switches associated with another user or tool, this may be something we want to learn more about.

rule strings_count_substring_example {
 meta:
   author = "Google Cloud Security"
   description = "Identify large number of /domain substrings being issued"
   severity = "Low"
 events:
   $process.metadata.event_type = "PROCESS_LAUNCH"
   $process.target.process.command_line !=""
   $process.principal.hostname = $hostname
 match:
   $hostname over 5m
 outcome:
   $string_sum = sum(strings.count_substrings(strings.to_lower($process.target.process.command_line), "/domain"))
   $command_line = array_distinct($process.target.process.command_line)
 condition:
   $process and $string_sum > 15
}

When I tested the rule, a detection for the host wrk-pacman contained 23 references in a five minute window to the command switch /domain. We can also view an array of the command lines that contains the pattern that the function is based on.

As you can see, strings.count_substrings provide a method to search for a pattern within a field which can be used to narrow the result set or aggregated further.

strings.length

Much like strings.count_substring, strings.length accepts a string and outputs an integer value that can then be used with additional statistical functions. The length function counts the number of characters in a specific field or variable. That’s it, it is pretty straightforward. This function can then be used to identify excessively long (or short) strings like command lines, DNS queries or user agent strings that might help us identify anomalous activities.

Here is a very basic search that returns events that have a command line length of more than 400 characters.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line !=""
strings.length(target.process.command_line) > 400

Our results include two identical command line strings issued five minutes apart from one another.

This search can be broadened to filter on process launch events with a long command line string, but then aggregated and displayed with the command lines and the maximum length calculated per hostname.

metadata.event_type = "PROCESS_LAUNCH"
target.process.command_line !=""
strings.length(target.process.command_line) > 400
match:
 principal.hostname
outcome:
 $event_count = count(metadata.event_type)
 $cmd_length = max(strings.length(target.process.command_line))
 $command_line = array_distinct(target.process.command_line)
order:
 $cmd_length desc

Here we can see that the longest command line string on the host wrk-pacman is 450 characters and there are a total of four events during our search window that exceed 400 characters. Due to the length, it may be worthwhile looking into these longer command line events.

Let’s pivot to network-centric events. Perhaps we have a decent handle on the kinds of user agent strings we would expect to see on certain systems and we want to identify abnormal user agent strings. Our search aggregates all events by the event type for HTTP events and outputs the listing of user agents in an array, while also calculating the length of each user agent string and generating statistical calculations, including the minimum and maximum length, and the average, median and standard deviation of the length.

metadata.event_type = "NETWORK_HTTP"
network.http.user_agent != ""
$ip_pair = strings.concat(principal.ip, " | "target.ip)
$ip_pair != /::1/
match:
 metadata.event_type
outcome:
 $event_count = count(metadata.event_type)
 $ua_max_length = max(strings.length(network.http.user_agent))
 $ua_min_length = min(strings.length(network.http.user_agent))
 $ua_avg_length = window.avg(strings.length(network.http.user_agent))
 $ua_median_length = window.median(strings.length(network.http.user_agent),false)
 $ua_std_length = window.stddev(strings.length(network.http.user_agent))
 $user_agent = array_distinct(network.http.user_agent)
order:
 $ua_max_length desc, $event_count desc

In our results, we can see the smallest user agent string is 3 characters, while the longest is 269. The median length across the entire data set is 14.

Let’s refine our search a bit further and aggregate by IP address pairing to generate these same statistical calculations.

metadata.event_type = "NETWORK_HTTP"
network.http.user_agent != ""
$ip_pair = strings.concat(principal.ip, " | "target.ip)
$ip_pair != /::1/
match:
 $ip_pair
outcome:
 $event_count = count(metadata.event_type)
 $ua_max_length = max(strings.length(network.http.user_agent))
 $ua_min_length = min(strings.length(network.http.user_agent))
 $ua_avg_length = window.avg(strings.length(network.http.user_agent))
 $ua_median_length = window.median(strings.length(network.http.user_agent),false)
 $ua_std_length = window.stddev(strings.length(network.http.user_agent))
 $user_agent = array_distinct(network.http.user_agent)
order:
 $ua_max_length desc, $event_count desc

Now we can see that we have some external scanners in our results, like Expanse, as well as other user agent strings that are a good deal larger than the median and average values we saw in our previous search. These user agents include references to iPhone and Nokia N9. Depending on the environment, user agents that are unexpected may be an interesting place to start hunting or investigating. Another technique could be to dig deeper into user agent strings that are excessively long.

In our case, I want to build a rule that focuses on events that are less than the median from our original search to learn a bit more about them.

rule strings_length_example {
 meta:
   author = "Google Cloud Security"
   description = "Identify user agent strings that are below the median of 14 and generate statistical values to additional consideration"
   severity = "Low"
 events:
   $net.metadata.event_type = "NETWORK_HTTP"
   $net.network.http.user_agent != ""
   strings.length($net.network.http.user_agent) < 14
   $net.target.ip != /::1/
   $net.principal.ip = $principal_ip
   $net.target.ip = $target_ip
 match:
   $principal_ip, $target_ip over 1h
 outcome:
   $event_count = count($net.metadata.event_type)
   $ua_max_length = max(strings.length($net.network.http.user_agent))
   $ua_min_length = min(strings.length($net.network.http.user_agent))
   $ua_avg_length = window.avg(strings.length($net.network.http.user_agent))
   $ua_median_length = window.median(strings.length($net.network.http.user_agent),false)
   $ua_std_length = window.stddev(strings.length($net.network.http.user_agent))
   $ua = array_distinct($net.network.http.user_agent)
 condition:
   $net
}

The rule looks a good bit like our search. I added a statement in the events section of the rule to focus the rule on just user agent strings that are less than 14 and I am aggregating the events on an hourly basis.

In the results, we have a few IP address pairs with their statistical calculations as well as an array of the user agents that meet the event criteria. I could tune my rule further before turning the alert on.

Today we covered two string functions that take string values and convert them to integers, which can then be used with additional statistical functions, as well as used to narrow result sets. Hopefully the concepts shared today will help you apply these functions to your use cases with Google SecOps!