python - difflib 的速度助手

Question

我正在使用 difflib ( SequenceMatcher) 来完成这项任务：对于 3000 个有错字的书名，在数据库中找到最接近的匹配项，其中有 128500 个（据称）没有错误的书名。代码很简单：

# imp and dbf are lists of ordered dicts 
# imp contains the messy titles to match (in imp['Titel:'] in each dict)
# dbf contains the clean titles to match against (in dbf['ti']) 

threshold = 0.65

for imprec in imp:
    bestmatch = threshold
    mpair = []

    try: 
        i = imprec['Titel:'][0]
    except KeyError: 
        print('record with InvNo %s has no title' % imprec['InvNo:'])
        continue

    for rec in dbf: 
        try: 
            r = rec['ti'][0]
        except KeyError:
            # record has no title. Do not make screen output.. 
            # print('record with priref %s has no title' % rec['%0'])
            continue

        m = SequenceMatcher(None, i, r)
        if m.ratio() > bestmatch: 
            bestmatch = m.ratio()
            mpair = [bestmatch, imprec, rec]
            print('{} matches for {:4.1f}% with {}'.format(i, m.ratio()*100, r))

    if bestmatch > threshold:
        match.append(mpair)
    else:
        nonmatch.append(imprec)

它有效，但速度很慢。大约 100 小时后，它在 3000 个列表中的比例约为 40%。问题当然是：128500 个标题的 3000 次迭代 = 3.855 亿次调用SequenceMatcher.

我正在寻找优化这一点的方法。在这篇文章中，OP 建立了一个索引，对其进行了查询，并且 SequenceMatched 了该查询的结果。我认为这是一个很好的方法，但如何实施呢？

该脚本只是一次性的，没有花哨的应用程序或任何东西。我的小时预算有限。

编辑Whoosh 支持模糊术语查询。SQLite 有LIKE.

我应该研究其他可能性吗？

python - difflib 的速度助手

0 回答 0

Related

Reference