I have a company database that has 800K values of company names. In the beginning, it was like 1.7 million but I removed the duplicates via df.drop_duplicates() command.
Now, the situation is, there are company names that are the same and have like 1 or 2 different or missing letters, punctuation differences, and so on. Example below. (based on result via difflib library get_close_matches)
I've tried difflib GetSequenceMatcher and I can say it worked like a charm. The code is below;
company_list = [] for item in df['Recipient']: company_list.append(item) close_matches = difflib.get_close_matches(company_list[0],company_list,n=100,cutoff=0.90) close_matches_list = [] for match in close_matches: close_matches_list.append(match) df.replace(close_matches_list,close_matches_list[0],inplace=True) df.drop_duplicates('Company',inplace=True)
What I am doing: 1- Getting close matches for the next index of the list 2- Replacing all close matches with the first index on the data frame 3- Dropping duplicates on the data frame
The problem is starting here. If I create a loop with while, it took around 3 hours for the first 1000 items and the data has around 800K values. I thought like using BigQuery will be quite faster than Python, I've checked questions about that but could not find the query and jaccard() and Levenshtein() udfs are just working on 2 strings.
Long story short, I have a string in the list I want to search on a column automatically with loops or etc on BigQuery.
It would be nice if there is a solution to it. I really appreciate it if there is any help.
in sklearn there is tfidfVectorizer; it can be configured to use character n-grams; fit() will train the index and transform() will return a sparse matrix of the trigrams already with unit vector norm. When you get a search-token you transform() and do the dot-product with the sparse-matrix. this gives you the cosine-similarity - just sort the values in descending order.
The good thing: very frequent trigrams in your db are down-weighted by tfidf; You get trigram-matching with more weight on less probable co-occurences 😉