0

I am working with a dataframe which has 2 columns of city names which should be equal. But they are not due to administrative errors, spelling mistakes or name changes. I am trying to see when those city names are 'equal enough' to be assumed equal. Using SequenceMatcher I can divide the list in roughly 3 parts: Everything is wrong, some are wrong, some are right, everything is right.

In a perfect world I would want the list to be divided in: Everything is wrong, Everything is right. Where the split can be made around a certain ratio/matching value.

Therefore, SequenceMatcher does not do the trick for me. I found textdistance but I get overwhelmed by the possibilities and probably there are more possibilities. An example where it goes wrong:

'Zeddam' and 'Didam' are classified with a ratio of 0.72 (Which are not equal). 'Nes' and 'Nes gem dongeradeel' with a ratio of 0.45 (Which are equal, the second just specifies it's province)

Just checking if one of the strings is a subset of the other string does not do the trick since it will give problems in other cases.

Do you guys have a suggestion on what string comparing algorithm is appropriate and why? I am comparing multiple columns in my dataframe, which is around 1000 rows.

4

0 回答 0