12

我在 MongoDB 中建立了一个大型银行数据库。我可以轻松地获取这些信息并用它创建索引。例如,我希望能够匹配银行名称“Eagle Bank & Trust Co of Missouri”和“Eagle Bank and Trust Company of Missouri”。以下代码适用于简单的模糊等,但无法实现上述匹配:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()

test_items = [u"Eagle Bank and Trust Company of Missouri"]

writer.add_document(name=item)
writer.commit()

from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm

with ix.searcher() as s:
    qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
    q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
    results = s.search(q)
    print results

给我:

<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>

Whoosh可以实现我想要的吗?如果不是,我还有什么其他基于 python 的解决方案?

4

4 回答 4

11

您可以在 Whoosh 中使用模糊搜索进行匹配Co,但您不应该这样做,因为和之间的差异很大。类似于as类似于和,你可以想象搜索结果会有多糟糕和有多大。CompanyCoCompanyCoCompanyBeBeastnyCompany

但是,如果您想匹配CompanorcompaniCompaneetoCompany您可以通过使用FuzzyTerm默认maxdist等于 2 或更多的 Personalized Class 来做到这一点:

maxdist – 与给定文本的最大编辑距离。

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

然后:

 qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)

您可以通过设置来匹配Co,但正如我所说,这会给出错误的搜索结果。我建议保持从到。 Companymaxdist5maxdist13

如果您正在寻找匹配一个词的语言变体,您最好使用whoosh.query.Variations.

注意:较旧的 Whoosh 版本具有minsimilarity而不是maxdist.

于 2015-05-28T09:34:53.943 回答
3

为了将来参考,必须有更好的方法来做到这一点,但这是我的镜头。

# -*- coding: utf-8 -*-
import whoosh
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.query import *
from whoosh.qparser import QueryParser

schema = Schema(name=TEXT(stored=True))
idx = create_in("C:\\idx_name\\", schema, "idx_name")

writer = idx.writer()

writer.add_document(name=u"This is craaazy shit")
writer.add_document(name=u"This is craaazy beer")
writer.add_document(name=u"Raphaël rocks")
writer.add_document(name=u"Rockies are mountains")

writer.commit()

s = idx.searcher()
print "Fields: ", list(s.lexicon("name"))
qp = QueryParser("name", schema=schema, termclass=FuzzyTerm)

for i in range(1,40):
    res = s.search(FuzzyTerm("name", "just rocks", maxdist=i, prefixlength=0))
    if len(res) > 0:
        for r in res:
            print "Potential match ( %s ): [  %s  ]" % ( i, r["name"] )
        break
    else:
        print "Pass: %s" % i

s.close()
于 2011-10-20T13:23:15.710 回答
1

也许其中一些东西可能会有所帮助(由 seatgeek 家伙开源的字符串匹配):

https://github.com/seatgeek/fuzzywuzzy

于 2011-07-17T08:30:08.937 回答
-3

你可以使用下面的这个函数来模糊搜索一组词对一个短语:

def FuzzySearch(text, phrase):
    """Check if word in phrase is contained in text"""
    phrases = phrase.split(" ")

    for x in range(len(phrases)):
        if phrases[x] in text:
            print("Match! Found " + phrases[x] + " in text")
        else:
            continue
于 2016-02-29T17:55:45.700 回答