我正在开发一个需要 Levenshtein 算法来计算两个字符串相似度的应用程序。
很久以前,我将一个 C# 版本(可以在互联网上很容易找到)改编为 VB.NET,它看起来像这样:
Public Function Levenshtein1(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim cost As Integer
Dim s1c As Char
For i = 1 To n
d(i, 0) = i
Next
For j = 1 To m
d(0, j) = j
Next
For i = 1 To n
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
然后,试图调整它并提高它的性能,我以版本结束:
Public Function Levenshtein2(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim s1c As Char
Dim cost As Integer
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
基本上,我认为距离数组 d(,) 可以在主循环内部初始化,而不是需要两个初始(和附加)循环。我真的认为这将是一个巨大的改进......不幸的是,不仅没有比原来的改进,它实际上运行得更慢!
我已经尝试通过查看生成的 IL 代码来分析这两个版本,但我就是无法理解。
所以,我希望有人能对这个问题有所了解并解释为什么第二个版本(即使它的循环次数更少)比原来的运行速度慢?
注意:时间差约为 0.15 纳秒。这看起来并不多,但是当您必须检查数以百万计的字符串时……差异变得非常显着。