Fun with data - hacking Hacker News

LookerExperts · ‎03-27-2022

This content, written by Lloyd Tabb, was initially posted in Looker Blog on Mar 29, 2016. The content is subject to limited support.

Hacking Hacker News

Last year, Google's uploaded complete data about Hacker News' posts to Google's BigQuery data engine. BigQuery is a giant clustered SQL engine in the cloud that can query enormous amounts of data very quickly. Felipe did some great of the dataset writing SQL by hand and graphing in Matlab.

Others on Hacker News picked up the thread and added their own analyses on (meta, right?). I loved the idea of exploring this public dataset and wanted to surface some new insights, and also to make it accessible to anyone who wanted to play with the data, whether or not they knew SQL.

So I wrote up a quick model in LookML (Looker's YAML-based modeling language) to describe how the BigQuery tables relate to each other. (It's all of four files and fewer than 300 lines of code; ).

Below, I'll walk through the process of building out the model and the analyses it makes easy (want to know what domain has the most stories on Hacker News? Or what words are most common? Or which author's stories score the highest? Read on!). But if you're in a hurry, jump right to a or .

Start with the raw data

Navigating to the , we see there are two tables, stories and comments. Both tables are relatively simple in structure.

Table stories

Each story contains a story id, the author that made the post, when it was written, and the score the story achieved, which I believe corresponds to 'points' on Each story also has a title and the URL containing the content.

Getting started

Let's first start with Stories. First we run Looker's generator to create a LookML model for stories. Each field in the table will have an associated LookML dimension. These dimensions are used both in Looker's Explorer and to write SQL statements and run them in BigQuery.

Without any changing any of this code, we can explore the data immediately.

explore: stories {}

view: stories {
  sql_table_name: [fh-bigquery:hackernews.stories] ;;

  measure: count {
    type: count
    drill_fields: [detail*]
  }

  dimension: id {
    type: int
    sql: ${TABLE}.id ;;
  }

  dimension: by {
    type: string
    sql: ${TABLE}.[by] ;;
  }

  dimension: score {
    type: int
    sql: ${TABLE}.score ;;
  }

  dimension: time {
    type: int
    sql: ${TABLE}.[time] ;;
  }

  dimension_group: post {
    type: time
    timeframes: [time, date, week, month, year]
    sql: ${TABLE}.time_ts ;;
  }

  dimension: title {
    type: string
    sql: ${TABLE}.title ;;
  }

  dimension: url {
    type: string
    sql: ${TABLE}.url ;;
  }

  dimension: text {
    type: string
    sql: ${TABLE}.text ;;
  }

  dimension_group: deleted {
    type: time
    timeframes: [time, date, week, month]
    sql: ${TABLE}.deleted ;;
  }

  dimension: dead {
    type: yesno
    sql: ${TABLE}.dead ;;
  }

  dimension: descendants {
    type: int
    sql: ${TABLE}.descendants ;;
  }

  dimension: author {
    type: string
    sql: ${TABLE}.author ;;
  }

  set: detail {
    fields: [id, post_time, author, title]
  }  
}

Let's start exploring

The first and most obvious question is, 'How many stories are there?'

Is the number of stories going up or down? Let's look by year.

Looks like stories peaked in 2013.

How do I get my story to the top of Hacker News?

I've tried to post a couple of times, but my stories never seem to go anywhere. Let's find my stories. I'm going to filter by 'lloydt' (my Hacker News name) and let's take a look.

Clicking on any of the counts will lead to my stories. Clicking on the 4 in the 2013 row's count column shows my stories for that year.

My best story scored a 4. Looking on the front page of Hacker news at the moment, we see posts with a variety of scores, but the lowest looks to be 7.

How are the scores distributed?

If use the score as a dimension (group by score, in SQL) and count the number of posts with each score, we should get an idea about how likely a story is to get a given score. Looking at the table and graph below we can see that many stories have a score in the 1-4 range like my stories.

Lucky 7

The goal here is to try and figure out if a story made it to the front page of Hacker News. Many stories each day are posted and most don't get there. It is pretty obvious that the distribution is heavily bifurcated; there are some stories that make it, but most stories don't.

Unfortunately, there is no obvious way to see the split in the data.

Sometimes, just picking a somewhat arbitrary threshold will help us find it. In this case, I'm going to pick 7 as a threshold for an interesting story. Later, we can investigate different thresholds, but for now, I'm going to say if a story has a score of 7 or more, it's interesting.

Let's build a new dimension. In LookML. Note we can reuse the score definition (${score}) in the main model to create score_7_plus.

dimension: score_7_plus {
  type: yesno
  sql: ${score} >= 7 ;;
}

Using the new score_7_plus dimension, we run a query and we can see that about 15% (300K/1959K) of the stories have a score of 7 or above.

Looker, using the LookML model, is writing all the SQL for us behind the scenes and sending it to BigQuery. The query it sent on our behalf was:

SELECT
  CASE WHEN stories.score >= 7 THEN 'Yes' ELSE 'No' END AS stories_score_7_plus,
  COUNT(*) AS stories_count
FROM [fh-bigquery:hackernews.stories]
 AS stories
GROUP EACH BY 1
ORDER BY 2 DESC
LIMIT 500

Who is the king of Hacker News?

Getting a front page story is no easy feat. Let's see if we can figure out of someone out there does it consistently. We're going to use our lucky 7 as our threshold. To examine this, we going to hold our grouping by Score 7 Plus and additionally group by Author. Looker lets us pivot the results. We're also going to sort by the Yes Count column to find the person with the most posts with a score of 7 or more.

It looks like an author cwan has had the most posts that have made it to the front page. To see what cwan posts about, we just click on his story count. All looker counts drill into the detail behind them. Let's look at cwan's posts.

Finding top posters about particular subject.

Filtering data to a smaller set can help us find trends about a particular subject.

We can go back to our original "top poster" query and research whose posts contain the word 'Facebook'. We'll see a different set of people.

Notice that the top poster ssclanfani has had 122 posts with 'Facebook' in the title and 65 of them have a score 7 or higher (about 50%).

iProject has had 323 posts about Facebook and only 29 have scored 7 or higher (about 10%).

Often, the devil is in the details. Many times, I've clicked into a number and looked at the underlying data records and seen some pattern. Let's look at ssclafani's Facebook posts and see if we can find something interesting. By clicking into the 57, we can see her posts.

Generalizing hit rate

Pivoting the data is helpful, but we're still doing some calculations by hand. We can create a couple of custom measures that will help us understand the data more readily.

We'll create a count of just the posts that scored 7 and above.

Then we'll create a measure that is the percentage of all posts that scored 7 and above.

LookML makes creating these measures easy.

Notice that we reuse the definition of score_7_plus.

measure: count_score_7_plus {
  type: count
  drill_fields: [detail*]

  filters: {
    field: score_7_plus
    value: "Yes"
  }
}

And we reuse the definition of count_score_7_plus in the following definition.

measure: percent_7_plus {
  type: number
  sql: 100.0 * ${count_score_7_plus} / ${count} ;;
}

With the new measures, we can rebuild and run our previous query. The percentage measure really helps us see that the author Slimy is quite good at placing stories; 65.22% of his stories score 7 or higher.

Are there people better than the author called 'slimy'?

Another advantage of creating a new measure is we can now sort by it. Let's sort by Percent 7 Plus and look at people that have posted more than 5 stories (again, an arbitrary number).

Where do the stories live?

Hacker News only contains titles and urls that point places (and comments). Let's take a look where the stories that are posted live. In order to do this, we'll have to parse out the host from the URL. We'll build a dimension in LookML that does this. BigQuery's SQL has a regular expression extractor that makes it pretty easy. LookML also has a way that we can write the html for the thing we are displaying.

We add the dimension to our model:

dimension: url_host {
  sql: REGEXP_EXTRACT(${url},'http://([^/]+)/') ;;
}

And now we can look at stories by the host they were posted to. Let's sort by Score 7 Plus.

And a peek at the sql that Looker wrote for us:

SELECT
  REGEXP_EXTRACT(stories.url,'http://([^/]+)/') AS stories_url_host,
  COUNT(*) AS stories_count,
  COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END) AS stories_count_score_7_plus,
  100.0 * (COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END)) / (COUNT(*)) AS stories_percent_7_plus
FROM [fh-bigquery:hackernews.stories]
 AS stories
GROUP EACH BY 1
ORDER BY 3 DESC
LIMIT 500

Domains are better

Domains are probably more interesting then hosts. After all, www.techcrunch.com and techcrunch.com both appear in this list. So let's build up another field that parses domain out of the host. We have to be careful to deal with hosts like 'bbc.co.uk', so we look for domains that end in two letters and grab more data.

dimension: url_domain {
  sql: REGEXP_EXTRACT(${url_host},'([^\\.]+\\.[^\\.]+(?:\\.[a-zA-Z].)?)$') ;;
}

Are there domains that are more successful than others? Lets look at hosts by Percent 7 Plus.

Whoops, looks like a bunch of one-hit-wonders. Let's eliminate hosts that have fewer than 20 successful posts.

Building a better indication that a post was on the front page

There is an old joke about a group of people that encounter a bear in the woods. They all start running from the bear. The joke is that you don't have to outrun the bear, you have to outrun the other people.

Hacker News scores are like that. We probably don't care what the actual score is, we just care that it's better than the other scores being posted on the same day.

We'll rank the the score for each day starting with 1 as the best score for the day and moving down.

In order to compute the daily rank, we'll need to use SQL's window function and a derived table in LookML. The output is a two column table with the id of the story and the rank of the story on the day it was posted.

view: daily_rank {
  derived_table: {
    sql: SELECT
         id
        , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
      FROM (
         SELECT
          id
          , DATE(time_ts) as post_date
          , score
         FROM [fh-bigquery:hackernews.stories]
         WHERE score > 0
      )
       ;;
  }

  dimension: id {
    primary_key: yes
    hidden: yes
  }

  dimension: daily_rank {
    type: number
  }
}

We can join this table into our stories explore.

explore: stories {
  join: daily_rank {
    sql_on: ${stories.id} = ${daily_rank.id} ;;
    relationship: one_to_one
  }
}

We can then look at our data by daily_rank and see the number of stories that match this. The data looks right. There are some 3000 days and a story for each rank for each day.

The SQL that Looker wrote for this query is below. As the model gets more and more complex, so do the queries, but because Looker is doing the query-writing behind the scenes, asking the question remains simple.

SELECT
  daily_rank.daily_rank AS daily_rank_daily_rank,
  COUNT(*) AS stories_count
FROM [fh-bigquery:hackernews.stories]
 AS stories
LEFT JOIN (SELECT
   id
  , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
   SELECT
    id
    , DATE(time_ts) as post_date
    , score
   FROM [fh-bigquery:hackernews.stories]
   WHERE score > 0
)
) AS daily_rank ON stories.id = daily_rank.id
GROUP EACH BY 1
ORDER BY 1
LIMIT 500

Let's build a new top 25 set of dimensions and measures

Like we did before, building dimensions and measures into the model will allow us to think in these terms.

We build them in a very similar way that we built our Score 7 measures. Notice we simply reference ${daily_rank.rank} and Looker figures out how to write the SQL to make it all fit together.

  # Was this post in the top 25 on a given day?
  dimension: rank_25_or_less {
    type: yesno
    sql: ${daily_rank.rank} <= 25 ;;
  }

  # How many posts were in the top 25 out of this group of posts?

  measure: count_rank_25_or_less {
    type: count
    drill_fields: [detail*]
    filters: {
      field: rank_25_or_less
      value: "Yes"
    }
  } 

  # What Percentage of posts were in the top 25 in group set of posts?
  measure: percent_rank_25_or_less {
    type: number
    sql: 100.0 * ${count_rank_25_or_less} / ${count} ;;
  }

And the simple output. Looks like about 4% of posts make it to the top 25 on a given day.

Now let's look at it by poster. Looks like Paul Graham (whose author name is "pg") has had lots of top 25 posts and a very high hit rate.

Wow. Looking by domain is an amazing list

Rerunning the query, this time by target domain with high story counts with rank 25 or less gives us a fascinating list of domains. The obvious ones are there -- nytimes, bbc.co.uk -- but scrolling down a little, I find domains I don't know about. Following the links (we'll talk about how to make these later) usually takes me to an interesting place.

Common words in top posts

We can now find top posts. Let's see if we can figure out why some posts are top. Are top posts talking about something in particular? Let's see if we can find common words in posts.

First, we're going to build a derived table that has two columns, the story id, and a word that appeared in the title.

view: story_words { derived_table: { sql: | SELECT id, SPLIT(title," ") as word FROM [fh-bigquery:hackernews.stories] stories ;; }

dimension:id { primary_key: yes hidden: yes }

dimension: word {}

Next we'll build an explore definition (the join relationships). We'll reuse both our prior views (the story view and the daily_rank view). The basis of this explore is a word, not a story.

explore: story_words {
  join: stories {
    sql_on: ${story_words.id}=${stories.id} ;;
    relationship: many_to_one
    type: left_outer_each
  }

  join: daily_rank {
    sql_on: ${stories.id} = ${daily_rank.id} ;;
    relationship: one_to_one
    type: left_outer_each
  }
}

Now we can explore by word. Let's look at the words in the posts with a rank of 25 or less. Scroll down a little to look past some of the small common words

and again, the SQL Looker is writing for us:

SELECT
  story_words.word AS story_words_word,
  COUNT(DISTINCT CASE WHEN daily_rank.daily_rank <= 25 THEN stories.id ELSE NULL END, 1000) AS stories_count_rank_25_or_less
FROM (SELECT id, SPLIT(title," ") as word
FROM [fh-bigquery:hackernews.stories] stories
) AS story_words
LEFT JOIN EACH [fh-bigquery:hackernews.stories]
 AS stories ON story_words.id=stories.id
LEFT JOIN EACH (SELECT
   id
  , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
   SELECT
    id
    , DATE(time_ts) as post_date
    , score
   FROM [fh-bigquery:hackernews.stories]
   WHERE score > 0
)
) AS daily_rank ON stories.id = daily_rank.id
GROUP EACH BY 1
ORDER BY 2 DESC
LIMIT 500

Of course clicking on any of the numbers will drill in and show us the stories.

Eliminating the common words with a Shakespeare

The common words are a problem. It would be great to eliminate or at least flag them.

To do this, we're going to use an inspired little hack.

BigQuery provides a nice little table of all the words in Shakespeare. The table consists of the word, the corpus it appeared in and what year the corpus was written.

We are going to find these 1000 words and then flag the words that we encounter that appear in the 1000 word list.

First, we write a little query to find the 1000 most common words in Shakespeare.

SELECT
    lower(word) as ssword
    , count(distinct corpus) as c
  FROM [publicdata:samples.shakespeare]
  GROUP BY 1
  ORDER BY 2
  DESC
  LIMIT 1000

With this word list, we can modify our derived table to have a new column, ssword, which if NOT NULL, means the word appears in Shakespeare (and we would consider it common).

view: story_words {
  derived_table: {
    sql: SELECT a.id as id, a.word as word, b.ssword as ssword
      FROM FLATTEN((
        SELECT id, LOWER(SPLIT(title," ")) as word
           FROM [fh-bigquery:hackernews.stories] stories
        ), word) as a
      LEFT JOIN (
         SELECT lower(word) as ssword
         , count(distinct corpus) as c
         FROM [publicdata:samples.shakespeare]
         GROUP BY 1
         ORDER BY 2
         DESC
         LIMIT 1000) as b
       ON a.word = b.ssword
       ;;
  }

  dimension: id {
    primary_key: yes
    hidden: yes
  }

  dimension: word {}

  dimension: is_comon_word {
    type: yesno
    sql: ${TABLE}.ssword IS NOT NULL ;;
  }
}

Now rerunning our query, with the common field, we can see what we've isolated some of the more common words.

And now without the common words:

Finally ... which words, if in the title of the story, are most likely to get you on the front page

Comparing

Now with a few clicks we can start comparing by filtering words to Microsoft, Google and Facebook. Let's compare front page posts by year.

Wiring this into an application

The next step is to make a data discovery application and cross wire all the research we've done so far. We easily build a dashboard that shows posts, over time, by domain, by author, by word, and success rates into making to a score of 7, and from a score of 7 into the top 25.

We wire up filters for author, domain and word, so that any of these will change all the data on the dashboard.

For example, let's look at Paul Graham (author name "pg"). He is posting a little less over time, likes to post about ycombinator, talks about yc, applications, hn and startups. His posts look very successful.

One of the nice things we can do in Looker is to create links when we render the result cells. Dimensions have an html: parameter that is rendered with .

Using this mechanism, we can cross link everywhere we display author, domain and word to point to a dashboard.

For example, we link author to both the dashboard and the profile page on Hacker News. We use emoji's to make it all work.

dimension: author {
  type: string
  sql: ${TABLE}.author ;;
  html: 
  <a href="/dashboards/169?author=" 
  title="Goto Dashboard" 
  target=new>⚡</a> 
  <a href="https://news.ycombinator.com/user?id=" 
  title="Goto news.ycombinator.com" 
  target=new>➚</a>;;
}

Other ideas to research

There is lots more to investigate.

What's really in a score? Is it related to the number of comments?
If it is, are there any cliffs in comments that might be a better indicator of "frontpageness"?
If there are multiple people that post the same URL with almost the exact same text, does timing matter?
What time of day is the best time to post? Does it matter?
Does velocity of comments matter in score? How soon after the post do comments need to happen before a score goes up?
Do some commenters matter more then others (for example, if you get a comment from someone that comments a lot, does that help your score more?)