This content, written by Lloyd Tabb, was initially posted in Looker Blog on Mar 29, 2016. The content is subject to limited support.
Last year, Google's uploaded complete data about Hacker News' posts to Google's BigQuery data engine. BigQuery is a giant clustered SQL engine in the cloud that can query enormous amounts of data very quickly. Felipe did some great of the dataset writing SQL by hand and graphing in Matlab.
Others on Hacker News picked up the thread and added their own analyses on (meta, right?). I loved the idea of exploring this public dataset and wanted to surface some new insights, and also to make it accessible to anyone who wanted to play with the data, whether or not they knew SQL.
So I wrote up a quick model in LookML (Looker's YAML-based modeling language) to describe how the BigQuery tables relate to each other. (It's all of four files and fewer than 300 lines of code; ).
Below, I'll walk through the process of building out the model and the analyses it makes easy (want to know what domain has the most stories on Hacker News? Or what words are most common? Or which author's stories score the highest? Read on!). But if you're in a hurry, jump right to a or .
Navigating to the , we see there are two tables, stories and comments. Both tables are relatively simple in structure.
Each story contains a story id, the author that made the post, when it was written, and the score the story achieved, which I believe corresponds to 'points' on Each story also has a title and the URL containing the content.
Let's first start with Stories. First we run Looker's generator to create a LookML model for stories. Each field in the table will have an associated LookML dimension. These dimensions are used both in Looker's Explorer and to write SQL statements and run them in BigQuery.
Without any changing any of this code, we can explore the data immediately.
explore: stories {}
view: stories {
sql_table_name: [fh-bigquery:hackernews.stories] ;;
measure: count {
type: count
drill_fields: [detail*]
}
dimension: id {
type: int
sql: ${TABLE}.id ;;
}
dimension: by {
type: string
sql: ${TABLE}.[by] ;;
}
dimension: score {
type: int
sql: ${TABLE}.score ;;
}
dimension: time {
type: int
sql: ${TABLE}.[time] ;;
}
dimension_group: post {
type: time
timeframes: [time, date, week, month, year]
sql: ${TABLE}.time_ts ;;
}
dimension: title {
type: string
sql: ${TABLE}.title ;;
}
dimension: url {
type: string
sql: ${TABLE}.url ;;
}
dimension: text {
type: string
sql: ${TABLE}.text ;;
}
dimension_group: deleted {
type: time
timeframes: [time, date, week, month]
sql: ${TABLE}.deleted ;;
}
dimension: dead {
type: yesno
sql: ${TABLE}.dead ;;
}
dimension: descendants {
type: int
sql: ${TABLE}.descendants ;;
}
dimension: author {
type: string
sql: ${TABLE}.author ;;
}
set: detail {
fields: [id, post_time, author, title]
}
}
The first and most obvious question is, 'How many stories are there?'
Is the number of stories going up or down? Let's look by year.
Looks like stories peaked in 2013.
I've tried to post a couple of times, but my stories never seem to go anywhere. Let's find my stories. I'm going to filter by 'lloydt' (my Hacker News name) and let's take a look.
Clicking on any of the counts will lead to my stories. Clicking on the 4 in the 2013 row's count column shows my stories for that year.
My best story scored a 4. Looking on the front page of Hacker news at the moment, we see posts with a variety of scores, but the lowest looks to be 7.
If use the score as a dimension (group by score, in SQL) and count the number of posts with each score, we should get an idea about how likely a story is to get a given score. Looking at the table and graph below we can see that many stories have a score in the 1-4 range like my stories.
The goal here is to try and figure out if a story made it to the front page of Hacker News. Many stories each day are posted and most don't get there. It is pretty obvious that the distribution is heavily bifurcated; there are some stories that make it, but most stories don't.
Unfortunately, there is no obvious way to see the split in the data.
Sometimes, just picking a somewhat arbitrary threshold will help us find it. In this case, I'm going to pick 7 as a threshold for an interesting story. Later, we can investigate different thresholds, but for now, I'm going to say if a story has a score of 7 or more, it's interesting.
Let's build a new dimension. In LookML. Note we can reuse the score definition (${score}) in the main model to create score_7_plus.
dimension: score_7_plus {
type: yesno
sql: ${score} >= 7 ;;
}
Using the new score_7_plus dimension, we run a query and we can see that about 15% (300K/1959K) of the stories have a score of 7 or above.
Looker, using the LookML model, is writing all the SQL for us behind the scenes and sending it to BigQuery. The query it sent on our behalf was:
SELECT
CASE WHEN stories.score >= 7 THEN 'Yes' ELSE 'No' END AS stories_score_7_plus,
COUNT(*) AS stories_count
FROM [fh-bigquery:hackernews.stories]
AS stories
GROUP EACH BY 1
ORDER BY 2 DESC
LIMIT 500
Getting a front page story is no easy feat. Let's see if we can figure out of someone out there does it consistently. We're going to use our lucky 7 as our threshold. To examine this, we going to hold our grouping by Score 7 Plus and additionally group by Author. Looker lets us pivot the results. We're also going to sort by the Yes Count column to find the person with the most posts with a score of 7 or more.
It looks like an author cwan has had the most posts that have made it to the front page. To see what cwan posts about, we just click on his story count. All looker counts drill into the detail behind them. Let's look at cwan's posts.
Filtering data to a smaller set can help us find trends about a particular subject.
We can go back to our original "top poster" query and research whose posts contain the word 'Facebook'. We'll see a different set of people.
Notice that the top poster ssclanfani has had 122 posts with 'Facebook' in the title and 65 of them have a score 7 or higher (about 50%).
iProject has had 323 posts about Facebook and only 29 have scored 7 or higher (about 10%).
Often, the devil is in the details. Many times, I've clicked into a number and looked at the underlying data records and seen some pattern. Let's look at ssclafani's Facebook posts and see if we can find something interesting. By clicking into the 57, we can see her posts.
Pivoting the data is helpful, but we're still doing some calculations by hand. We can create a couple of custom measures that will help us understand the data more readily.
We'll create a count of just the posts that scored 7 and above.
Then we'll create a measure that is the percentage of all posts that scored 7 and above.
LookML makes creating these measures easy.
Notice that we reuse the definition of score_7_plus.
measure: count_score_7_plus {
type: count
drill_fields: [detail*]
filters: {
field: score_7_plus
value: "Yes"
}
}
And we reuse the definition of count_score_7_plus in the following definition.
measure: percent_7_plus {
type: number
sql: 100.0 * ${count_score_7_plus} / ${count} ;;
}
With the new measures, we can rebuild and run our previous query. The percentage measure really helps us see that the author Slimy is quite good at placing stories; 65.22% of his stories score 7 or higher.
Another advantage of creating a new measure is we can now sort by it. Let's sort by Percent 7 Plus and look at people that have posted more than 5 stories (again, an arbitrary number).
Hacker News only contains titles and urls that point places (and comments). Let's take a look where the stories that are posted live. In order to do this, we'll have to parse out the host from the URL. We'll build a dimension in LookML that does this. BigQuery's SQL has a regular expression extractor that makes it pretty easy. LookML also has a way that we can write the html for the thing we are displaying.
We add the dimension to our model:
dimension: url_host {
sql: REGEXP_EXTRACT(${url},'http://([^/]+)/') ;;
}
And now we can look at stories by the host they were posted to. Let's sort by Score 7 Plus.
And a peek at the sql that Looker wrote for us:
SELECT
REGEXP_EXTRACT(stories.url,'http://([^/]+)/') AS stories_url_host,
COUNT(*) AS stories_count,
COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END) AS stories_count_score_7_plus,
100.0 * (COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END)) / (COUNT(*)) AS stories_percent_7_plus
FROM [fh-bigquery:hackernews.stories]
AS stories
GROUP EACH BY 1
ORDER BY 3 DESC
LIMIT 500
Domains are probably more interesting then hosts. After all, www.techcrunch.com and techcrunch.com both appear in this list. So let's build up another field that parses domain out of the host. We have to be careful to deal with hosts like 'bbc.co.uk', so we look for domains that end in two letters and grab more data.
dimension: url_domain {
sql: REGEXP_EXTRACT(${url_host},'([^\\.]+\\.[^\\.]+(?:\\.[a-zA-Z].)?)$') ;;
}
Are there domains that are more successful than others? Lets look at hosts by Percent 7 Plus.
Whoops, looks like a bunch of one-hit-wonders. Let's eliminate hosts that have fewer than 20 successful posts.
There is an old joke about a group of people that encounter a bear in the woods. They all start running from the bear. The joke is that you don't have to outrun the bear, you have to outrun the other people.
Hacker News scores are like that. We probably don't care what the actual score is, we just care that it's better than the other scores being posted on the same day.
We'll rank the the score for each day starting with 1 as the best score for the day and moving down.
In order to compute the daily rank, we'll need to use SQL's window function and a derived table in LookML. The output is a two column table with the id of the story and the rank of the story on the day it was posted.
view: daily_rank {
derived_table: {
sql: SELECT
id
, RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
SELECT
id
, DATE(time_ts) as post_date
, score
FROM [fh-bigquery:hackernews.stories]
WHERE score > 0
)
;;
}
dimension: id {
primary_key: yes
hidden: yes
}
dimension: daily_rank {
type: number
}
}
We can join this table into our stories explore.
explore: stories {
join: daily_rank {
sql_on: ${stories.id} = ${daily_rank.id} ;;
relationship: one_to_one
}
}
We can then look at our data by daily_rank and see the number of stories that match this. The data looks right. There are some 3000 days and a story for each rank for each day.
The SQL that Looker wrote for this query is below. As the model gets more and more complex, so do the queries, but because Looker is doing the query-writing behind the scenes, asking the question remains simple.
SELECT
daily_rank.daily_rank AS daily_rank_daily_rank,
COUNT(*) AS stories_count
FROM [fh-bigquery:hackernews.stories]
AS stories
LEFT JOIN (SELECT
id
, RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
SELECT
id
, DATE(time_ts) as post_date
, score
FROM [fh-bigquery:hackernews.stories]
WHERE score > 0
)
) AS daily_rank ON stories.id = daily_rank.id
GROUP EACH BY 1
ORDER BY 1
LIMIT 500
Like we did before, building dimensions and measures into the model will allow us to think in these terms.
We build them in a very similar way that we built our Score 7 measures. Notice we simply reference ${daily_rank.rank} and Looker figures out how to write the SQL to make it all fit together.
# Was this post in the top 25 on a given day?
dimension: rank_25_or_less {
type: yesno
sql: ${daily_rank.rank} <= 25 ;;
}
# How many posts were in the top 25 out of this group of posts?
measure: count_rank_25_or_less {
type: count
drill_fields: [detail*]
filters: {
field: rank_25_or_less
value: "Yes"
}
}
# What Percentage of posts were in the top 25 in group set of posts?
measure: percent_rank_25_or_less {
type: number
sql: 100.0 * ${count_rank_25_or_less} / ${count} ;;
}
And the simple output. Looks like about 4% of posts make it to the top 25 on a given day.
Now let's look at it by poster. Looks like Paul Graham (whose author name is "pg") has had lots of top 25 posts and a very high hit rate.
Rerunning the query, this time by target domain with high story counts with rank 25 or less gives us a fascinating list of domains. The obvious ones are there -- nytimes, bbc.co.uk -- but scrolling down a little, I find domains I don't know about. Following the links (we'll talk about how to make these later) usually takes me to an interesting place.
We can now find top posts. Let's see if we can figure out why some posts are top. Are top posts talking about something in particular? Let's see if we can find common words in posts.
First, we're going to build a derived table that has two columns, the story id, and a word that appeared in the title.
view: story_words { derived_table: { sql: | SELECT id, SPLIT(title," ") as word FROM [fh-bigquery:hackernews.stories] stories ;; }
dimension:id { primary_key: yes hidden: yes }
dimension: word {}
Next we'll build an explore definition (the join relationships). We'll reuse both our prior views (the story view and the daily_rank view). The basis of this explore is a word, not a story.
explore: story_words {
join: stories {
sql_on: ${story_words.id}=${stories.id} ;;
relationship: many_to_one
type: left_outer_each
}
join: daily_rank {
sql_on: ${stories.id} = ${daily_rank.id} ;;
relationship: one_to_one
type: left_outer_each
}
}
Now we can explore by word. Let's look at the words in the posts with a rank of 25 or less. Scroll down a little to look past some of the small common words
and again, the SQL Looker is writing for us:
SELECT
story_words.word AS story_words_word,
COUNT(DISTINCT CASE WHEN daily_rank.daily_rank <= 25 THEN stories.id ELSE NULL END, 1000) AS stories_count_rank_25_or_less
FROM (SELECT id, SPLIT(title," ") as word
FROM [fh-bigquery:hackernews.stories] stories
) AS story_words
LEFT JOIN EACH [fh-bigquery:hackernews.stories]
AS stories ON story_words.id=stories.id
LEFT JOIN EACH (SELECT
id
, RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
SELECT
id
, DATE(time_ts) as post_date
, score
FROM [fh-bigquery:hackernews.stories]
WHERE score > 0
)
) AS daily_rank ON stories.id = daily_rank.id
GROUP EACH BY 1
ORDER BY 2 DESC
LIMIT 500
Of course clicking on any of the numbers will drill in and show us the stories.
The common words are a problem. It would be great to eliminate or at least flag them.
To do this, we're going to use an inspired little hack.
BigQuery provides a nice little table of all the words in Shakespeare. The table consists of the word, the corpus it appeared in and what year the corpus was written.
We are going to find these 1000 words and then flag the words that we encounter that appear in the 1000 word list.
First, we write a little query to find the 1000 most common words in Shakespeare.
SELECT
lower(word) as ssword
, count(distinct corpus) as c
FROM [publicdata:samples.shakespeare]
GROUP BY 1
ORDER BY 2
DESC
LIMIT 1000
With this word list, we can modify our derived table to have a new column, ssword, which if NOT NULL, means the word appears in Shakespeare (and we would consider it common).
view: story_words {
derived_table: {
sql: SELECT a.id as id, a.word as word, b.ssword as ssword
FROM FLATTEN((
SELECT id, LOWER(SPLIT(title," ")) as word
FROM [fh-bigquery:hackernews.stories] stories
), word) as a
LEFT JOIN (
SELECT lower(word) as ssword
, count(distinct corpus) as c
FROM [publicdata:samples.shakespeare]
GROUP BY 1
ORDER BY 2
DESC
LIMIT 1000) as b
ON a.word = b.ssword
;;
}
dimension: id {
primary_key: yes
hidden: yes
}
dimension: word {}
dimension: is_comon_word {
type: yesno
sql: ${TABLE}.ssword IS NOT NULL ;;
}
}
Now rerunning our query, with the common field, we can see what we've isolated some of the more common words.
And now without the common words:
Now with a few clicks we can start comparing by filtering words to Microsoft, Google and Facebook. Let's compare front page posts by year.
The next step is to make a data discovery application and cross wire all the research we've done so far. We easily build a dashboard that shows posts, over time, by domain, by author, by word, and success rates into making to a score of 7, and from a score of 7 into the top 25.
We wire up filters for author, domain and word, so that any of these will change all the data on the dashboard.
For example, let's look at Paul Graham (author name "pg"). He is posting a little less over time, likes to post about ycombinator, talks about yc, applications, hn and startups. His posts look very successful.
One of the nice things we can do in Looker is to create links when we render the result cells. Dimensions have an html: parameter that is rendered with .
Using this mechanism, we can cross link everywhere we display author, domain and word to point to a dashboard.
For example, we link author to both the dashboard and the profile page on Hacker News. We use emoji's to make it all work.
dimension: author {
type: string
sql: ${TABLE}.author ;;
html:
<a href="/dashboards/169?author="
title="Goto Dashboard"
target=new>⚡</a>
<a href="https://news.ycombinator.com/user?id="
title="Goto news.ycombinator.com"
target=new>➚</a>;;
}
There is lots more to investigate.