python - How to do fuzzy string search without a heavy database?

Question

I have a mapping of catalog numbers to product names:

35  cozy comforter
35  warm blanket
67  pillow

and need a search that would find misspelled, mixed names like "warm cmfrter".

We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.

I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.

SQLite doesn't usually have full-text or scoring compiled in.

The Xapian bindings are like C++ and have some learning curve.

Whoosh is not yet well-documented but includes an abusable spell-checker.

What else is there?

score 4 · Accepted Answer

显然，快速进行模糊比较的唯一方法是减少它们；）

我们现在保留一个单词索引，而不是编写另一个 n-gram 搜索或改进 Whoosh 中的搜索，检索与查询至少有一个（正确拼写）单词的所有条目，并使用 difflib 对它们进行排名。在这种情况下工作得很好。

score 3 · Accepted Answer

使用 SOUNDEX 实施会得到太多误报。只有 26,000 个（最多）可能的 SOUNDEX 代码。

虽然 Metaphone 算法是为英文姓氏设计的，但它对于拼写错误的效果非常好；我在分支定位器中使用过一次，非常成功。

添加一个带有 Metaphone 翻译的字段，如果没有找到完全匹配，则与之匹配。您仍然会得到误报，但使用其他算法时会更少。

score 2 · Accepted Answer

Nucular 具有全文搜索功能，但不支持开箱即用的拼写错误匹配。您可以尝试向每个条目添加一个附加字段，该字段索引术语的 SOUNDEX翻译，然后使用用户输入的 soundex 翻译进行搜索。我真的不知道这会有多好...

看看advas http://advas.sourceforge.net/news.php有一个很好的演示，比较了各种类似 soundex 的方法：

advas/examples Aaron$ python phonetic_algorithms.py 
                    soundex       metaphone           nyiis      caverphone 
====================================================================================================
 schmidt :             S253           sxmtt          sssnad      SKMT111111
  schmid :             S253            sxmt          sssnad      SKMT111111
 schmitt :             S253            sxmt         sssnatt      SKMT111111
   smith :             S530            sm0h           snatt      SMT1111111
  smythe :             S530           smy0h           snatt      SMT1111111
 schmied :             S253            sxmt         sssnaad      SKMT111111
   mayer :             M600             myr           naaar      MA11111111
   meier :             M600              mr           naaar      MA11111111
....

我不知道它们中的任何一个是否适合您的未命名语言...

score 1 · Accepted Answer

计算两个字符串之间的编辑距离的常用方法是一种相当昂贵的算法（如果我没记错的话，它的时间复杂度是二次的）。也许如果您使用不同的字符串相似度指标，那么您的问题就会消失。

我最喜欢的模糊字符串匹配方法之一是trigrams matching。使用这种方法比较两个字符串具有线性时间复杂度，这比提到的编辑距离要好得多。你可以在Github上找到我的 Python 实现。还有一个 PostgreSQL contrib 模块正是为此而生的。让它适应 SQLite3 应该不会太难。

score 1 · Accepted Answer

Sybase SQL Anywhere有一个免费的 Web 版/开发者版，并带有全文索引/搜索和一个 FUZZY 运算符（以及一些实现约束）。

从文档中引用：

Specifying 'FUZZY "500 main street"' is equivalent to 
'500 OR mai OR ain OR str OR tre OR ree OR eet'.

另一种方法是对全文搜索使用评分。

score 1 · Accepted Answer

sqlite3 支持 python 回调函数。Matthew Barnett 的正则表达式 (http://code.google.com/p/mrab-regex-hg/) 现在支持近似匹配。

所以，像这样：

try:
    import regex
except ImportError:
    sys.stderr.write("Can't import mrab-regex; see http://pypi.python.org/pypi/regex\n")
    sys.exit(1)

def _sqlite3_regex(expr, item):
    return (not (not regex.search(expr, item)))

def main():
    ...
    database = sqlite3.connect(dbfile)
    database.create_function("regexp", 2, _sqlite3_regex)
    pattern = '(?:%s){e<=%d}' % (queriedname, distance)
    print [x for x in database.cursor().execute(
         "SELECT * FROM products WHERE (productname regexp '%s')" % pattern)]

python - How to do fuzzy string search without a heavy database?

6 回答 6

Related

Reference