python - High performance fuzzy string comparison in Python, use Levenshtein or difflib

Question

I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.

I want to do fuzzy string comparison, but I'm not sure which library to use.

Option 1:

import Levenshtein
Levenshtein.ratio('hello world', 'hello')

Result: 0.625

Option 2:

import difflib
difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()

Result: 0.625

In this example both give the same answer. Do you think both perform alike in this case?

score 198 · Accepted Answer

如果您对 Levenshtein 和 Difflib 相似性的快速视觉比较感兴趣，我计算了大约 230 万本书的标题：

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

然后我用 R 绘制了结果：

在此处输入图像描述

出于好奇，我还比较了 Difflib、Levenshtein、Sørensen 和 Jaccard 的相似度值：

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

结果：在此处输入图像描述

Difflib / Levenshtein 的相似性确实很有趣。

2018 年编辑：如果您正在努力识别相似的字符串，您还可以查看 minhashing——这里有一个很好的概述。Minhashing 擅长在线性时间内发现大型文本集合中的相似性。我的实验室在此处组装了一个应用程序，该应用程序使用 minhashing 检测和可视化文本重用：https ://github.com/YaleDHLab/intertext

score 127 · Accepted Answer

difflib.SequenceMatcher 使用Ratcliff/Obershelp算法计算匹配字符数的两倍除以两个字符串中的字符总数。
Levenshtein 使用Levenshtein 算法，它计算将一个字符串转换为另一个字符串所需的最小编辑次数

复杂

SequenceMatcher 是最坏情况的二次时间，其预期情况的行为以复杂的方式取决于序列有多少共同元素。（从这里）

Levenshtein 是 O(m*n)，其中 n 和 m 是两个输入字符串的长度。

表现

根据 Levenshtein 模块的源代码： Levenshtein 与 difflib (SequenceMatcher) 有一些重叠。它只支持字符串，不支持任意序列类型，但另一方面它要快得多。

python - High performance fuzzy string comparison in Python, use Levenshtein or difflib

2 回答 2

Related

Reference