python - 大致匹配公司名称

Question

我的数据库中有 1200 万个公司名称。我想将它们与离线列表匹配。我想知道这样做的最佳算法。我已经通过 Levenstiens 距离做到了这一点，但它没有给出预期的结果。您能否建议一些相同的算法。问题与公司相匹配

G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated

score 2 · Accepted Answer

您可以使用模糊集，将所有公司名称放入模糊集中，然后匹配一个新术语以获得匹配分数。一个例子：

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')

此外，如果您想使用语义，而不仅仅是字符串（在这种情况下效果更好），那么看看spacysimilarity。来自 spacy 文档的示例：

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

score 2 · Accepted Answer

您可能应该首先扩展两个列表（数据库和列表）中的已知后缀。这将需要一些手动工作来确定正确的映射，例如使用正则表达式：

\s+inc\.?$->Incorporated
\s+corp\.?$->Corporation

您可能还想进行其他规范化，例如小写所有内容、删除标点符号等。

然后，您可以使用 Levenshtein 距离或其他模糊匹配算法。

score 1 · Accepted Answer

Interzoid 的公司名称匹配高级 API 生成相似性键来帮助解决这个问题...您调用 API 生成一个相似性键，消除所有噪音、已知同义词、soundex、ML 等...然后您匹配相似性键而不是数据本身以获得更高的匹配率（商业 API，免责声明：我为 Interzoid 工作）

https://interzoid.com/services/getcompanymatchadvanced

score -1 · Accepted Answer

使用 MatchKraft 模糊匹配两个列表中的公司名称。

http://www.matchkraft.com/

Levenstiens 距离不足以解决这个问题。您还需要以下内容：

改进执行时间的启发式方法
信息检索 (Lucene) 和 SQL
公司名称数据库

最好使用现有工具，而不是在 Python 中创建程序。

python - 大致匹配公司名称

4 回答 4

Related

Reference