BigQuery - How to Check String Similarity in a Col...

sirin · 11-25-2022 10:46 AM

I have a company database that has 800K values of company names. In the beginning, it was like 1.7 million but I removed the duplicates via df.drop_duplicates() command.

Now, the situation is, there are company names that are the same and have like 1 or 2 different or missing letters, punctuation differences, and so on. Example below. (based on result via difflib library get_close_matches)

I've tried difflib GetSequenceMatcher and I can say it worked like a charm. The code is below;

company_list = []
for item in df['Recipient']:
    company_list.append(item)
close_matches = difflib.get_close_matches(company_list[0],company_list,n=100,cutoff=0.90)
close_matches_list = []
for match in close_matches:
    close_matches_list.append(match)
df.replace(close_matches_list,close_matches_list[0],inplace=True)
df.drop_duplicates('Company',inplace=True)

What I am doing: 1- Getting close matches for the next index of the list 2- Replacing all close matches with the first index on the data frame 3- Dropping duplicates on the data frame

The problem is starting here. If I create a loop with while, it took around 3 hours for the first 1000 items and the data has around 800K values. I thought like using BigQuery will be quite faster than Python, I've checked questions about that but could not find the query and jaccard() and Levenshtein() udfs are just working on 2 strings.

Long story short, I have a string in the list I want to search on a column automatically with loops or etc on BigQuery.

It would be nice if there is a solution to it. I really appreciate it if there is any help.

nitram

in sklearn there is tfidfVectorizer; it can be configured to use character n-grams; fit() will train the index and transform() will return a sparse matrix of the trigrams already with unit vector norm. When you get a search-token you transform() and do the dot-product with the sparse-matrix. this gives you the cosine-similarity - just sort the values in descending order.

The good thing: very frequent trigrams in your db are down-weighted by tfidf; You get trigram-matching with more weight on less probable co-occurences 😉

BigQuery - How to Check String Similarity in a Column and Update Column Values Automatically?