根据这篇维基百科文章,我有以下计算 2 个字符串的 Damerau-Levenshtein 距离的 cython 实现,但目前它对我的需求来说太慢了。我有一个大约 600000 个字符串的列表,我必须在该列表中找到拼写错误。
如果有人能提出任何算法改进或一些可以减少脚本运行时间的 python/cython 魔法,我会很高兴。我并不真正关心它使用了多少空间,只关心计算所需的时间。
根据使用大约 2000 个字符串对脚本进行分析,它在damerauLevenshteinDistance
函数中花费了 80% 的完整运行时间(30 秒中的 24 秒),我完全不知道如何让它更快。
def damerauLevenshteinDistance(a, b, h):
"""
a = source sequence
b = comparing sequence
h = matrix to store the metrics (currently nested list)
"""
cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
alphabet = getAlphabet((a,b))
lena = len(a)
lenb = len(b)
inf = lena + lenb + 1
da = [0 for x in xrange(0, len(alphabet))]
for i in xrange(1, lena+1):
db = 0
for j in xrange(1, lenb+1):
i1 = da[alphabet[b[j-1]]]
j1 = db
d = 1
if (a[i-1] == b[j-1]):
d = 0
db = j
h[i+1][j+1] = min(
h[i][j]+d,
h[i+1][j]+1,
h[i][j+1]+1,
h[i1][j1]+(i-i1-1)+1+(j-j1-1)
)
da[alphabet[a[i-1]]] = i
return h[lena+1][lenb+1]
cdef getAlphabet(words):
"""
construct an alphabet out of the lists found in the tuple words with a
sequential identifier for each word
"""
cdef int i
alphabet = {}
i = 0
for wordList in words:
for letter in wordList:
if letter not in alphabet:
alphabet[letter] = i
i += 1
return alphabet