我有一种情况,我需要将给定字符串中的名称与名称数据库进行匹配。下面我给出了一个非常简单的例子来说明我遇到的问题,我不清楚为什么一个案例比另一个案例有效?如果我没记错的话,extractOne() 的 Python 默认算法是 Levenshtein 距离算法。是因为克莱门斯的名字提供了前两个名字的首字母,而冈萨雷斯只有一个名字吗?
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
s = ['Gonzalez, E. walked down the street.', 'Gonzalez, R. went to the market.', 'Clemens, Ko. reach the intersection; Clemens, Ka. did not.']
names = []
for i in s:
name = [] #clear name
for k in i.split():
if k[0].isupper(): name.append(k)
else: break
names.append(' '.join(name))
if ';' in i:
for each in i.split(';')[1:]:
name = [] #clear name
for k in each.split():
if k[0].isupper(): name.append(k)
else: break
names.append(' '.join(name))
print(names)
choices = ['Kody Clemens','Kacy Clemens','Gonzalez Ryan', 'Gonzalez Eddy']
for i in names:
s = process.extractOne(i, choices)
print(s, i)
输出:
['Gonzalez, E.', 'Gonzalez, R.', 'Clemens, Ko.', 'Clemens, Ka.']
('Gonzalez Ryan', 85) Gonzalez, E.
('Gonzalez Ryan', 85) Gonzalez, R.
('Kody Clemens', 86) Clemens, Ko.
('Kacy Clemens', 86) Clemens, Ka.