python - 在两组名称之间找到最接近的近似匹配

Question

我有两组名称，我想找到两者之间最接近的匹配项，如果没有找到“足够接近”的匹配项，我想将名称与其自身匹配。

我目前的方法是创建一个包含所有可能组合的数据框，并使用 .apply 或列表来迭代并通过 SequenceMatcher（导入为 sm）计算相似度。

问题是我在两个列表中都有几千个名字，这导致了难以处理的运行时间。

理想情况下，我的匹配标准是 sm 比率 >= 0.85，第一个名字在第二个名字中作为一个整体出现。如果不满足这些条件，则名称应与其自身匹配。

我想实现的最后一步是用这些匹配的名称替换原始系列。

这是我当前方法的代码，如果不清楚，请告诉我如何帮助澄清：

stop_words = [
             'pharmaceuticals',
             'pharmaceutical',
             'pharma',
             'therapeutic',
             'biopharma',
             'therapeutics',
             'international',
             'biotechnology',
             'vaccines',
             '\&',
             '&',
             'co.',
             'co',
             'biotherapeutics',
             'biotherapeutic',
             'life science',
             'life sciences',
             'laboratories',
             'biologics',
             ]

temp_db_companies = db['Company']

def approximate_match(s1, s2):
    return str(sm(None, str(s1).lower().strip(), str(s2).lower().strip()).ratio()) + '|' + str(s2).strip()


def replace_val_w_best_match(df, col_initial, series_final, stop_words):
    init_names = df[col_initial].str.lower().str.split(" ", expand=True).replace(stop_words, "").fillna('')

    init_names = pd.Series(['' for n in range(len(init_names))]).str.cat([init_names[col] for col in init_names.columns], sep= " ").str.replace('  ', ' ').reset_index()

    matching_df = pd.DataFrame(columns = list(init_names.columns) + list(series_final), data = init_names)

    matching_df = pd.melt(matching_df,
                          id_vars = ['index', 0],
                          value_vars = list(series_final),
                          var_name = 'Comparators',
                          value_name = 'Best match')

#    matching =  matching_df.apply(lambda row: approximate_match(row[0], row['Comparators']), axis = 1)

    l = [(matching_df[0]), list(matching_df['Comparators'])]

    ratio = [sm(None, name1, name2) for name1 in l[0] for name2 in l[1]]

    match = [name2 for name1 in l[0] for name2 in l[1]]

    print(ratio[:5])
    print(match[:5])

score 1 · Accepted Answer

您可能正在寻找的是 Levenshtein 距离算法。它计算将一个字符串转换为另一个字符串所需的最小编辑次数。

查看这个库： https ://github.com/ztane/python-Levenshtein/

Levenshtein 库有一个名为 StringMatcher.py 的类，旨在帮助您解决这个问题。

该库还包括类似的功能：https ://github.com/gfairchild/pyxDamerauLevenshtein

python - 在两组名称之间找到最接近的近似匹配

1 回答 1

Related

Reference