Hello HN.
I built Flookup Data Wrangler, a powerful Google Sheets add-on for data cleaning without writing single line of code.
Traditional Soundex is designed for single words like "John" and "Jonny", making data cleaning comparisons between such strings straightforward. However, typical Soundex outputs cannot be used to handle multi-word or reordered string comparisons like "John Doe" vs "Doe Jonny", as this would produce inaccurate results.
To address this, I modified the Soundex algorithm to support multi-word and reordered strings by adding a helper function that re-encodes the output into a format that can be used for accurate text-to-text comparisons. The optimisation keeps overhead minimal, ensuring negligible impact on performance.
By leveraging this enhancement, Flookup users can do the following:
+ Fuzzy matching and merging
+ Duplicate highlighting and removal
+ Extracting a list of unique values
... all based on the sound the strings or parts of the strings make (as pronounced in English).
I would love feedback, especially from those into data cleaning (which I'm guessing is everyone).
If you are curious to give it a try, here is a quick start guide: https://www.getflookup.com/get-started
Comments URL: https://news.ycombinator.com/item?id=44035382
Points: 4
# Comments: 0