Finding closest matches by name in BigQuery - Page 2

fabio1 · 06-15-2018 08:57 AM

We’ve all been there before: someone gives you a dataset without ID’s, keyed only by name. You either file it away and never look at it again, or spend the next two days matching up the records by hand.

Not to condone this situation, but it happens, so here’s a query (in BigQuery using a Javascript UDF) to help speed this up. I tried simply using Levenshtein distance, but found too many mismatches, so switched to a strategy of counting matching substrings of a given length. Even with this approach, there will be mismatches, so use this to speed up a manual review, rather than simply making it a PDT.

Note that you can customize:

stopWords, which get filtered out entirely. I chose this specific list when trying to match company names
tokens, which get shortened to limit their impact on the match score

grams, the number of consecutive characters that are considered when calculating the score. Maybe set it down to 2 or 3 if your dataset is composed of phrases with shorter words

  CREATE TEMPORARY FUNCTION match(str1 STRING, str2 STRING)
  RETURNS FLOAT64
  LANGUAGE js AS """
    var stopWords = /\\b(The|inc|ASP|Co|Ltd|LLC|SA|SAS|BV|Corp|formerly|previously|by|as|dba|com)\\b/ig
    var stopChars = /[-_ \\.,'"'\\/()]/g
    var tokens = /\\b(Global|Media Group|Group|Limited|Solutions|Software|Computing|Networks?|Systems|Health|Brands|Project|Services|Technologies|Technology|Companies|Holdings?|Studios?|Digital|Mobile|Financial|Management|Media|Maintenance|Agency|Support|Bank|reseller|account|internal|external|International|Dental|University|Corporation|Labs|Enterprises?|GmbH|Industries|Incorporated|Entertainment|Society|Security|Interactive|Capital)\\b/ig
    var grams = 4
    var tokenValue = 2
    var substrs1 = extract(str1)
    var substrs2 = extract(str2)
    var maxL = max(substrs1.length, substrs2.length)
    return ( parseInt(100*(
        substrs1.filter(s=>~substrs2.indexOf(s)).length
        +substrs2.filter(s=>~substrs1.indexOf(s)).length
      ) / (0.01+maxL) / 2))
    function max(a,b){return a>b?a:b}
    function extract(str){
      return str.toLowerCase()
        .replace(stopWords,"")
        .replace(tokens, m=>m.slice(0,tokenValue).toUpperCase())
        .replace(stopChars,"")
        .split('').map((char,c,arr)=>arr.slice(max(0,1+c-grams),1+c).join('')).filter(s=>s.length>=3)
    }
  """;
  WITH closest_by_name as (
  	SELECT 
  	  dataset.name, 
  	  MAX(match(dataset.name,official.name)) as similarity,
  	  CASE MAX(match(dataset.name,official.name))
  	  WHEN 0 THEN NULL
  	  ELSE SUBSTR(MAX(
  		CONCAT(
  		  CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
                    ,CAST(1000000 + official.tie_breaker as string)
  		  ,official.name
  		)
  	  ),5+7) END as closest_name,
  	  CASE MAX(match(dataset.name,official.name))
  	  WHEN 0 THEN NULL
  	  ELSE SUBSTR(MAX(
  		CONCAT(
  		  CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
                    ,CAST(1000000 + official.tie_breaker as string)
  		  ,official.id
  		)
  	  ),5+7) END as corresponding_id
  	FROM (
  	  SELECT DISTINCT name FROM `table1`
  	) AS dataset 
  	CROSS JOIN (
  	  SELECT id, name, ROW_NUMBER() OVER () as tie_breaker
  	  FROM `table2`
  	  --WHERE optional filter
  	) as official
  	GROUP BY 1
  )
  SELECT main.name, main.similarity, main.closest_name, main.id_of_closest_name, 
  STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) as id_is_better_matched_to,
  STRING_AGG( CASE WHEN dupe.similarity < main.similarity THEN dupe.name END) as id_is_also_closest_to,
  CASE WHEN STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) IS NULL THEN main.id_of_closest_name END as suggested_id
  FROM closest_by_name as main
  LEFT JOIN closest_by_name as dupe
   ON  dupe.id_of_closest_name = main.id_of_closest_name
   AND dupe.name <> main.name
  GROUP BY 1,2,3,4
  ORDER BY 2 ASC