Stats table calcs: comparing rates over time (Bonu... - Page 2

fabio1 · 10-05-2017 09:20 PM

If you haven’t already read my post on using the beta distribution for an A/B test, that might be a good warm-up read. (The disclaimers from that post still apply 😉 )

In that example, the data took the form of a proportion - a number of successes out of a number of trials. What do you do if you want to instead compare rates of events over intervals? For example, user signups, error events, or web traffic over time.

Well, in the A/B testing example, the data points can be thought of as being taken from a Bernoulli distribution: either yes or no with a given probability. That probability is in turn modeled by the conjugate prior of the Bernoulli distributions, the beta distributions, so we used the beta_inv table calc.

In the present scenario, the data points can be thought of as being taken from a Poisson distribution: its possible values are whole positive numbers representing a number of events happening in an interval assuming those events occur independently at a given rate. That rate, which is what we are actually interested in quantifying, can be thought of as being taken from the Poissons’ conjugate priors: Gamma distributions.

For the sake of illustration, lets say you had data like this:

Month   | Days    | Errors
------------------------------------------
Jan     | 31      | 12
Feb     | 28      | 13
Mar     | 31      | 19

Is the rate of errors in March higher than in February? On the surface it looks that way, since the observed rate was 19/31=0.61 events per day, as compared to the observed rate of 13/28 = 0.46 events per day in February.

But with any observation, there is variability and we are interested in the “true” underlying rate that generated this data, moreso than the observed rate in our sample. The conjugate prior distribution allows us to understand exactly that - given a set of observations, what the true underlying rate may be.

With our example, when you use the conjugate prior (gamma distribution), you see there is quite a bit of range, including lots of overlap between what the rate of errors may be for March and for February:

Just like in the A/B testing example, you could use the inverse distribution function to come up with credibility intervals and then visualize them using the timeline viz:

The upper/lower bound formulas for 90% credibility intervals would be

gamma_inv(
    0.5 +/- 0.45,
    ${...events} + 0.001, 
    1/( ${...days} + 0.001)
)

Bonus: Anomaly Detection

Let’s try to detect when there is a drop in a rate of events. We’ll continue with the gamma distribution, but these concepts can just as easily be applied to detect a change in a proportion, with the beta distribution.

We’ll make a look that is designed to be run every day. We’ll have it pull some trailing activity by day, for example 60 days’ worth. Then we’ll add some formulas to detect when there is a significant drop:

(In this example, the threshold happens to be on a day with 0 events, but it is highlighting a difference in data before and after the line, not simply highlighting that day for being 0)

A practical consideration is choosing the windows of time to compare. There are many ways to do this, with different levels of finesse such as dynamically choosing baseline periods, normalizing for weekly or seasonal trends, or pushing the window functions into SQL instead of table calcs, but for demonstrative purposes, I’ll simply use table calcs to consider windows of increasing size looking back from the current day, and to compare those windows to similarly sized prior windows. Then I’ll highlight in red the first day (if any) with a significant, sizable drop. Here are the specific formulas I used:

N Periods: row()
2 N rows available?: row() <= count(${accidents.event_date}) / 2
^ We will use this to “Hide No’s” since we need half of the rows for comparison’s sake only
Trailing N Total: running_total(${accidents.count})
Prior N Total: sum(offset_list(${accidents.count},1,row()))
Trailing N Upper Bound: gamma_inv(0.95, ${trailing_n_total} + 0.001, 1 / (${n_periods} + 0.001))
Prior N Lower Bound: gamma_inv(0.05, ${prior_n_total} + 0.001, 1 / (${n_periods} + 0.001))
Is practical difference?: ${trailing_n_upper_bound} < ${prior_n_lower_bound} * 0.667
^ Normally you want to specify a ratio of “practical equivalence”, here a drop of over 1/3rd
As 0/1: if( ${is_practical_difference}, 1, 0)
Difference running total: running_total(${as_01})
Most Recent Deviation: if( ${difference_running_total}=1 AND offset(${difference_running_total},-1)=0, max(${accidents.count}), null)
^This will be used to highlight the deviation on the chart

Setting that last measure to a column allows us to get a “highlight”:

In conclusion, we can say (with high probability) that to the right of this line, the true underlying rate of events has dropped by at least a third, as compared to the same sized period immediately to the left of this line. And because we’re using Bayesian statistics, these formulas are robust against small datasets, meaning no manual exception handling or awkward sample size declarations up front.

Stats table calcs: comparing rates over time (Bonus: anomaly detection!)

Bonus: Anomaly Detection