Friday, May 27, 2016

Comparison and Review of .NET Fuzzy Matching Nuget Packages

I am simply using Jaro-Winkler to get a similarity factor of 2 strings.  I'm using this for name and address comparisons and doing my own score aggregation and weighting.

I first tried Fuzzy String.  Unfortunately, it has several issues preventing it from working properly.  Even among these issues, I found other examples that caused the Jaro-Winkler algorithm to go into an infinite loop.  It's funny that this package has a 5 star rating, because for my use case, only using Jaro-Winkler, it failed miserably.

Then I tried BlueSimilarity.  This package also had issues loading a BlueSimilarity.Interop.dll.  At this point I was tired of troubleshooting and just wanted a solution that worked.  Besides, on nuget the project site is a broken link.  Man.

Finally I tried SimMetrics-TextFunctions.  This worked really well!  I had a few small unit tests to simply verify that the bugs in FuzzyString are not in this implementation.  Awesome!
EDIT: Wow.  I found out 7 months later that this package does indeed have a bug.  It is easy to work around, but I consider it a bug non-the-less.  This code, with a space prefix on one of the strings returns with a zero similarity.  EDIT 2: Over a year later, I have found a bug.  With strings "Canyon Rd" and "Canyon Est Dr" I am getting a similarity score of .4.  It should be much higher than that.  So...  I'm changing my implementation... again.

Now I am using some code from Stack Overflow.  I'm really surprised that all the NuGet packages have some kind of issue and that this bit of code passes where all the others fail.  Thank goodness for unit tests.  I have a ton of them.

No comments:

Post a Comment