This content, written by Scott Hoover, was initially posted in Looker Blog on Mar 27, 2014. The content is subject to limited support.
Recently, the Looker Analytics team was posed with a unique challenge by a client: come up with a way to exclude weekends from an analysis that calculates the duration between two times.
The data in question captures the rate at which jobs move from various stages in production. Each row of the table of interest (hereinafter velocity) represents a single job — which, along with other attributes, has an approved and a completed time.
velocity table
1 | 2013-07-31 04:00:00 | 2013-08-01 11:00:00 | 2013-08-04 13:00:00 | ... | 1 | 1 |
2 | 2013-03-07 07:00:00 | 2013-03-07 10:00:00 | 2013-03-12 14:00:00 | ... | 4 | 1 |
... | ... | ... | ... | ... | ... | ... |
n | 2013-05-07 09:00:00 | 2013-05-08 02:00:00 | 2013-05-16 12:00:00 | ... | 3 | 2 |
The first pass at addressing approved-to-complete velocity relied on a simple date difference:
SELECT
job_id
,
approved_at
,
published_at
,
EXTRACT
(
'EPOCH'
FROM
(
completed_at
-
approved_at
))
/
3600
AS
velocity_in_hours
FROM
velocity
The problem with this approach is that some jobs start in one week, extend over one or more weekends, and are completed in some subsequent week; however, work is not done on weekends, and the velocity for jobs that extend over one or more weekends is penalized in a sense. This is why Looker was asked to come up with a solution to exclude weekends from the analysis.
A few approaches were posed, but the final solution relied on a look-up table containing three columns: a date_time field, a day_of_week field, and an integer, hour_seq, indicating the number of hours since the first hour in this table (hereinafter sequence). The table would be used to get a corresponding integer (again, the number of hours since the beginning of the sequence table) for the approved_at and completed_at for each job and then subtract the two, yielding the number of hours between the two times.
The key to the solution was that the integers in the sequence table would not increment on weekend hours. The resulting table would look like this:
sequence table
2008-01-01 00:00:00 | 2 | 1 |
2008-01-01 01:00:00 | 2 | 2 |
2008-01-01 02:00:00 | 2 | 3 |
... | ... | ... |
2008-01-04 21:00:00 | 5 | 93 |
2008-01-04 22:00:00 | 5 | 94 |
2008-01-04 23:00:00 | 5 | 95 |
2008-01-05 00:00:00 | 6 | 95 |
2008-01-05 01:00:00 | 6 | 95 |
2008-01-05 02:00:00 | 6 | 95 |
... | ... | ... |
2008-01-06 23:00:00 | 0 | 95 |
2008-01-07 00:00:00 | 1 | 96 |
2008-01-07 01:00:00 | 1 | 97 |
To achieve this, we needed to start with a table of hours from the beginning of time (or 2008, when the business started) going well into the future. We relied on recursive common table expressions (CTEs) in PostgreSQL to generate a sequence of numbers and then add hours accordingly, but this could achieved in a number of ways for dialects without recursive CTEs.
WITH
RECURSIVE
seq
(
n
)
AS
(
SELECT
1
UNION
ALL
SELECT
n
+
1
FROM
seq
)
--generate a sequence of numbers
,
hours
AS
(
SELECT
TIMESTAMP
'2008-01-01 00:00:00'
+
n
*
INTERVAL
'1 hour'
AS
date_time
FROM
seq
LIMIT
87600
)
--generate ten years of hours from 2008-01-01 00:00:00 into the future
SELECT
date_time
,
EXTRACT
(
dow
FROM
date_time
)
AS
day_of_week
,
SUM
(
CASE
WHEN
EXTRACT
(
dow
FROM
date_time
)
NOT
IN
(
6
,
0
)
THEN
1
ELSE
NULL
END
)
OVER
(
ORDER
BY
date_time
)
AS
hour_seq
FROM
hours
The penultimate step was to twice join the sequence table into velocity and map each created_at and completed_at entry to its corresponding hour_seq. The final velocity table looked something like this:
velocity table
1 | 2013-07-31 04:00:00 | 2013-08-01 11:00:00 |
34979 | 2013-08-04 13:00:00 |
35015 | ... | 1 | 1 |
2 | 2013-03-07 07:00:00 |
2013-03-07 10:00:00 |
32458 | 2013-03-12 14:00:00 |
32534 | ... | 4 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
n | 2013-05-07 09:00:00 |
2013-05-08 02:00:00 |
33506 | 2013-05-16 12:00:00 |
33660 | ... | 3 | 2 |
With a velocity table that had corresponding hour sequence numbers for each approved and completed time, we could simply difference these two values to see the time spent in production, in hours. This could be easily extended to capture any timeframe.
SELECT
approved_at
,
completed_at
,
completed_int
-
approved_int
AS
duration_in_hours
FROM
velocity
— fin —