optimization - Optimizing Levenshtein distance algorithm

Question

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:

ALTER function dbo.Levenshtein
( 
    @Source nvarchar(200), 
    @Target nvarchar(200) 
) 
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)

SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0

WHILE @j <= @Target_len
BEGIN
    SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END

WHILE @i <= @Source_len
BEGIN
    SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1

WHILE @j <= @Target_len
BEGIN
    SET @Dist = @Dist + 1
    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
                  CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END

    IF @Dist > @Dist_temp
    BEGIN
        SET @Dist = @Dist_temp
    END

    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1

    IF @Dist > @Dist_temp SET @Dist = @Dist_temp
    BEGIN
        SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
    END
END

SELECT @Distv1 = @Distv0, @i = @i + 1
END

RETURN @Dist
END

Where should I go from here?

score 6 · Accepted Answer

我过去这样做的方法是将“数据库”（实际上是拼写纠正器的单词字典）存储为 trie。

然后我使用分支定界例程来查找最近的匹配条目。对于小距离，所花费的时间是距离的指数。对于大距离，它在字典的大小上是线性的，就像你现在看到的那样。

分支定界基本上是树的深度优先树遍历，但有一个错误预算。在每个节点上，您跟踪当前的 levenshtein 距离，如果它超出预算，则修剪树的那个分支。

首先，您以零预算进行步行。那只会找到完全匹配的。如果您没有找到匹配项，那么您可以预算为一个。这将在距离为 1 处找到匹配项。如果您没有找到任何匹配项，则使用预算 2 进行匹配，依此类推。这听起来效率低下，但由于每次步行都比前一次花费更多的时间，所以时间主要由你最后一次步行决定。

补充：代码大纲（请原谅我的C）：

// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
  tnodeTag* p[128];
} tnode;

tnode* top; // the top of the trie

void walk(tnode* p, char* s, int budget){
  int i;
  if (*s == 0){
    if (p == NULL){
      // print the current trie path
    }
  }
  else if (budget >= 0){
    // try deleting this letter
    walk(p, s+1, budget-1);
    // try swapping two adjacent letters
    if (s[1]){
      swap(s[0], s[1]);
      walk(p, s, budget-1);
      swap(s[0], s[1]);
    }
    if (p){
      for (i = 0; i < 128; i++){
        // try exact match
        if (i == *s) walk(p->p[i], s+1, budget);
        // try replacing this character
        if (i != *s) walk(p->p[i], s+1, budget-1);
        // try inserting this letter
        walk(p->p[i], s, budget-1);
      }
    }
  }
}

基本上，您通过跳过它并在同一个节点上搜索来模拟删除一个字母。您可以通过在不推进 s 的情况下降低 trie 来模拟插入一个字母。你模拟替换一个字母，就像字母匹配一样，即使它不匹配。当你掌握了它的窍门后，你可以添加其他可能的不匹配，比如用 O 替换 0 和用 L 或 I 替换 1 - 像这样的愚蠢的东西。

您可能想要添加一个字符数组参数来表示您在 trie 中找到的当前单词。

optimization - Optimizing Levenshtein distance algorithm

1 回答 1

Related

Reference