vb.net - 同一算法的两种实现之间的性能差异

Question

我正在开发一个需要 Levenshtein 算法来计算两个字符串相似度的应用程序。

很久以前，我将一个 C# 版本（可以在互联网上很容易找到）改编为 VB.NET，它看起来像这样：

Public Function Levenshtein1(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim cost As Integer
    Dim s1c As Char

    For i = 1 To n
        d(i, 0) = i
    Next
    For j = 1 To m
        d(0, j) = j
    Next

    For i = 1 To n
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

然后，试图调整它并提高它的性能，我以版本结束：

Public Function Levenshtein2(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim s1c As Char
    Dim cost As Integer

    For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j

            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

基本上，我认为距离数组 d(,) 可以在主循环内部初始化，而不是需要两个初始（和附加）循环。我真的认为这将是一个巨大的改进......不幸的是，不仅没有比原来的改进，它实际上运行得更慢！

我已经尝试通过查看生成的 IL 代码来分析这两个版本，但我就是无法理解。

所以，我希望有人能对这个问题有所了解并解释为什么第二个版本（即使它的循环次数更少）比原来的运行速度慢？

注意：时间差约为 0.15 纳秒。这看起来并不多，但是当您必须检查数以百万计的字符串时……差异变得非常显着。

score 2 · Accepted Answer

正因为如此：

 For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j 'THIS LINE HERE

你刚开始初始化这个数组，但现在你初始化它n次。像这样访问数组中的内存是有代价的，现在你要多做n次。您可以将行更改为：If i = 1 Then d(0, j) = j。但是，在我的测试中，您基本上仍然会得到比原始版本稍慢的版本。这又是有道理的。您正在执行此 if 语句 n*m 次。再次有一些成本。像原始版本一样将其移出要便宜得多它最终是 O(n)。由于整体算法是 O(n*m)，因此您可以移出 O(n) 步骤的任何步骤都将是一场胜利。

score 2 · Accepted Answer

您可以拆分以下行：

d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)

如下：

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)

这样你就避免了一次求和

此外，您可以将最后一个“min”比较放在 if 部分中并避免分配成本：

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If

所以当 s1c = s2(j - 1)

score 0 · Accepted Answer

不是您问题的直接答案，但为了获得更快的性能，您应该考虑使用锯齿状数组（数组数组）而不是多维数组。多维数组和 C# 中的数组数组有什么区别？为什么 .NET中的多维数组比普通数组慢？

您将看到锯齿状数组的代码大小为 7，而多维数组的代码大小为 10。

下面的代码是使用一个交错数组，一维数组

Public Function Levenshtein3(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d()() As Integer = New Integer(n)() {}
    Dim cost As Integer
    Dim s1c As Char

    For i = 0 To n
        d(i) = New Integer(m) {}
    Next

    For j = 1 To m
        d(0)(j) = j
    Next

    For i = 1 To n
        d(i)(0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function

vb.net - 同一算法的两种实现之间的性能差异

3 回答 3

Related

Reference