Finding closest matches by name in BigQuery

We’ve all been there before: someone gives you a dataset without ID’s, keyed only by name. You either file it away and never look at it again, or spend the next two days matching up the records by hand.

Not to condone this situation, but it happens, so here’s a query (in BigQuery using a Javascript UDF) to help speed this up. I tried simply using Levenshtein distance, but found too many mismatches, so switched to a strategy of counting matching substrings of a given length. Even with this approach, there will be mismatches, so use this to speed up a manual review, rather than simply making it a PDT.

Note that you can customize:

  • stopWords, which get filtered out entirely. I chose this specific list when trying to match company names

  • tokens, which get shortened to limit their impact on the match score

  • grams, the number of consecutive characters that are considered when calculating the score. Maybe set it down to 2 or 3 if your dataset is composed of phrases with shorter words

      CREATE TEMPORARY FUNCTION match(str1 STRING, str2 STRING)
      RETURNS FLOAT64
      LANGUAGE js AS """
        var stopWords = /\\b(The|inc|ASP|Co|Ltd|LLC|SA|SAS|BV|Corp|formerly|previously|by|as|dba|com)\\b/ig
        var stopChars = /[-_ \\.,'"'\\/()]/g
        var tokens = /\\b(Global|Media Group|Group|Limited|Solutions|Software|Computing|Networks?|Systems|Health|Brands|Project|Services|Technologies|Technology|Companies|Holdings?|Studios?|Digital|Mobile|Financial|Management|Media|Maintenance|Agency|Support|Bank|reseller|account|internal|external|International|Dental|University|Corporation|Labs|Enterprises?|GmbH|Industries|Incorporated|Entertainment|Society|Security|Interactive|Capital)\\b/ig
        var grams = 4
        var tokenValue = 2
        var substrs1 = extract(str1)
        var substrs2 = extract(str2)
        var maxL = max(substrs1.length, substrs2.length)
        return ( parseInt(100*(
            substrs1.filter(s=>~substrs2.indexOf(s)).length
            +substrs2.filter(s=>~substrs1.indexOf(s)).length
          ) / (0.01+maxL) / 2))
        function max(a,b){return a>b?a:b}
        function extract(str){
          return str.toLowerCase()
            .replace(stopWords,"")
            .replace(tokens, m=>m.slice(0,tokenValue).toUpperCase())
            .replace(stopChars,"")
            .split('').map((char,c,arr)=>arr.slice(max(0,1+c-grams),1+c).join('')).filter(s=>s.length>=3)
        }
      """;
      WITH closest_by_name as (
      	SELECT 
      	  dataset.name, 
      	  MAX(match(dataset.name,official.name)) as similarity,
      	  CASE MAX(match(dataset.name,official.name))
      	  WHEN 0 THEN NULL
      	  ELSE SUBSTR(MAX(
      		CONCAT(
      		  CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
                        ,CAST(1000000 + official.tie_breaker as string)
      		  ,official.name
      		)
      	  ),5+7) END as closest_name,
      	  CASE MAX(match(dataset.name,official.name))
      	  WHEN 0 THEN NULL
      	  ELSE SUBSTR(MAX(
      		CONCAT(
      		  CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
                        ,CAST(1000000 + official.tie_breaker as string)
      		  ,official.id
      		)
      	  ),5+7) END as corresponding_id
      	FROM (
      	  SELECT DISTINCT name FROM `table1`
      	) AS dataset 
      	CROSS JOIN (
      	  SELECT id, name, ROW_NUMBER() OVER () as tie_breaker
      	  FROM `table2`
      	  --WHERE optional filter
      	) as official
      	GROUP BY 1
      )
      SELECT main.name, main.similarity, main.closest_name, main.id_of_closest_name, 
      STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) as id_is_better_matched_to,
      STRING_AGG( CASE WHEN dupe.similarity < main.similarity THEN dupe.name END) as id_is_also_closest_to,
      CASE WHEN STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) IS NULL THEN main.id_of_closest_name END as suggested_id
      FROM closest_by_name as main
      LEFT JOIN closest_by_name as dupe
       ON  dupe.id_of_closest_name = main.id_of_closest_name
       AND dupe.name <> main.name
      GROUP BY 1,2,3,4
      ORDER BY 2 ASC
2 4 2,906
4 REPLIES 4
Top Labels in this Space
Top Solution Authors