Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc
is closer to abx
than to abp
, because p
is farther away from x
on my keyboard than c
. Another option involves accounting for swapped characters- swap
is a more likely correction of sawp
that saw
, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw
and then randomly insert a p
at the end.
The rules above are called error model
- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work
for more examples.
PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex
.