25

我刚刚实现了一个最佳匹配文件搜索算法来查找与字典中的字符串最接近的匹配。在分析我的代码后,我发现绝大多数时间都花在了计算查询与可能结果之间的距离上。我目前正在实现使用二维数组计算 Levenshtein 距离的算法,这使得实现成为 O(n^2) 操作。我希望有人可以提出一种更快的方法来做同样的事情。

这是我的实现:

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}
4

6 回答 6

26

Levenshtein distance 上的维基百科条目为优化计算提供了有用的建议——在您的情况下,最适用的建议是,如果您可以限制k最大感兴趣距离(超出此范围的任何距离都可能是无穷大!),您可以减少计算O(n times k)而不是O(n squared)(基本上是在最小可能距离变为 时立即放弃> k)。

由于您正在寻找最接近的匹配,您可以逐渐减少k到迄今为止找到的最佳匹配的距离 - 这不会影响最坏情况的行为(因为匹配可能按距离递减顺序,这意味着您永远不会尽快救助)但平均情况应该会有所改善。

我相信,如果您需要获得更好性能,您可能必须接受一些强有力的妥协,以计算更近似的距离(因此得到“相当好的匹配”而不是最佳匹配)。

于 2010-07-06T02:46:38.910 回答
7

根据这篇博客的评论Speeding Up Levenshtein,您可以使用 VP-Trees 并实现 O(nlogn)。同一博客上的另一条评论指向VP-Trees 和 Levenshtein 的 python 实现。请让我们知道这是否有效。

于 2010-07-06T02:53:12.447 回答
3

Wikipedia 文章讨论了您的算法以及各种改进。但是,至少在一般情况下,O(n^2)似乎是你能得到的最好的。

但是,如果您可以限制您的问题,则可以进行一些改进(例如,如果您只对小于d的距离感兴趣,则复杂性为O(dn) - 这可能是因为距离接近字符串长度的匹配是有意义的可能不是很有趣)。看看你是否可以利用你的问题的细节......

于 2010-07-06T02:47:59.440 回答
3

我修改了这篇文章中的 Levenshtein 距离 VBA 函数以使用一维数组。它执行得更快。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function

我已经用长度为 11,304 的字符串 's1' 和长度为 5,665 的字符串 's2' 测试了这个函数(> 6400 万个字符比较)。使用上述一维版本的函数,在我的机器上执行时间约为 24 秒。我在上面的链接中引用的原始二维函数对于相同的字符串需要大约 37 秒。我已经进一步优化了一维函数,如下所示,相同的字符串需要大约 10 秒。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function
于 2015-04-09T16:55:37.523 回答
2

Commons-lang 有一个相当快的实现。请参阅http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm

这是我对 Scala 的翻译:

// The code below is based on code from the Apache Commons lang project.
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License. You may obtain a copy of the License at
 * 
 * http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
  // Inside impl n <= m!
  val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
  val d = new Array[Int](n + 1) // cost array, horizontally

  @tailrec def fillP(i: Int) {
    p(i) = i
    if (i < n) fillP(i + 1)
  }
  fillP(0)

  @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
    d(0) = j
    @tailrec def eachI(i: Int) {
      val a = d(i - 1) + 1
      val b = p(i) + 1
      d(i) = if (a < b) a else {
        val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
        if (b < c) b else c
      }
      if (i < n)
        eachI(i + 1)
    }
    eachI(1)

    if (j < m)
      eachJ(j + 1, t.charAt(j), p, d)
    else
      d(n)
  }
  eachJ(1, t.charAt(0), d, p)
}

val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
  if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}

}

于 2011-08-24T06:57:43.687 回答
2

我知道这已经很晚了,但它与手头的讨论有关。

正如其他人所提到的,如果您只想检查两个字符串之间的编辑距离是否在某个阈值 k 内,则可以将时间复杂度降低到O(kn)。更精确的表达式是O((2k+1)n)。您取一条横跨对角单元两侧的 k 个单元的条带(条带的长度 2k+1)并计算位于该条带上的单元格的值。

有趣的是,李等人做出了改进。人。并且这已进一步简化为 O((k+1)n)。

于 2012-12-30T19:54:39.460 回答