c# - 如何计算给定2个字符串的距离相似性度量？

Question

我需要计算 2 个字符串之间的相似度。那我到底是什么意思？让我用一个例子来解释：

真话：hospital
错字：haspita

现在我的目标是确定我需要多少个字符来修改错误的单词以获得真实的单词。在这个例子中，我需要修改 2 个字母。那么百分比是多少呢？我总是取真实单词的长度。所以它变成 2 / 8 = 25% 所以这 2 个给定的字符串 DSM 是 75%。

我怎样才能在性能成为关键考虑因素的情况下实现这一目标？

score 75 · Accepted Answer

几周前我刚刚解决了这个完全相同的问题。既然有人现在问，我会分享代码。在我详尽的测试中，即使没有提供最大距离，我的代码也比 Wikipedia 上的 C# 示例快 10 倍左右。当提供最大距离时，此性能增益增加到 30x - 100x +。请注意性能的几个关键点：

如果您需要一遍又一遍地比较相同的单词，首先将单词转换为整数数组。Damerau-Levenshtein 算法包括许多 >、<、== 比较，并且ints比较速度比chars.
它包括一个短路机制，如果距离超过提供的最大值，则退出
使用一组旋转的三个数组，而不是像我在其他地方看到的所有实现那样的大矩阵
确保您的数组切片跨越较短的字宽。

代码（如果在参数声明中替换int[]为，它的工作原理完全相同：String

/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {

    int length1 = source.Length;
    int length2 = target.Length;

    // Return trivial case - difference in string lengths exceeds threshhold
    if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }

    // Ensure arrays [i] / length1 use shorter length 
    if (length1 > length2) {
        Swap(ref target, ref source);
        Swap(ref length1, ref length2);
    }

    int maxi = length1;
    int maxj = length2;

    int[] dCurrent = new int[maxi + 1];
    int[] dMinus1 = new int[maxi + 1];
    int[] dMinus2 = new int[maxi + 1];
    int[] dSwap;

    for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }

    int jm1 = 0, im1 = 0, im2 = -1;

    for (int j = 1; j <= maxj; j++) {

        // Rotate
        dSwap = dMinus2;
        dMinus2 = dMinus1;
        dMinus1 = dCurrent;
        dCurrent = dSwap;

        // Initialize
        int minDistance = int.MaxValue;
        dCurrent[0] = j;
        im1 = 0;
        im2 = -1;

        for (int i = 1; i <= maxi; i++) {

            int cost = source[im1] == target[jm1] ? 0 : 1;

            int del = dCurrent[im1] + 1;
            int ins = dMinus1[i] + 1;
            int sub = dMinus1[im1] + cost;

            //Fastest execution for min value of 3 integers
            int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);

            if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                min = Math.Min(min, dMinus2[im2] + cost);

            dCurrent[i] = min;
            if (min < minDistance) { minDistance = min; }
            im1++;
            im2++;
        }
        jm1++;
        if (minDistance > threshold) { return int.MaxValue; }
    }

    int result = dCurrent[maxi];
    return (result > threshold) ? int.MaxValue : result;
}

在哪里Swap：

static void Swap<T>(ref T arg1,ref T arg2) {
    T temp = arg1;
    arg1 = arg2;
    arg2 = temp;
}

score 52 · Accepted Answer

您要查找的内容称为编辑距离或Levenshtein 距离。维基百科文章解释了它是如何计算的，并在底部有一段很好的伪代码，可以帮助您非常轻松地用 C# 编写这个算法。

这是下面链接的第一个站点的实现：

private static int  CalcLevenshteinDistance(string a, string b)
    {
    if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
        return 0;
    }
    if (String.IsNullOrEmpty(a)) {
        return b.Length;
    }
    if (String.IsNullOrEmpty(b)) {
        return a.Length;
    }
    int  lengthA   = a.Length;
    int  lengthB   = b.Length;
    var  distances = new int[lengthA + 1, lengthB + 1];
    for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
    for (int j = 0;  j <= lengthB;  distances[0, j] = j++);

    for (int i = 1;  i <= lengthA;  i++)
        for (int j = 1;  j <= lengthB;  j++)
            {
            int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
            distances[i, j] = Math.Min
                (
                Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                distances[i - 1, j - 1] + cost
                );
            }
    return distances[lengthA, lengthB];
    }

score 37 · Accepted Answer

有大量的字符串相似距离算法可以使用。此处列出的一些（但未详尽列出）：

列文斯坦
针线工拳头
史密斯沃特曼
史密斯沃特曼后藤
雅罗，雅罗·温克勒
杰卡德相似度
欧几里得距离
骰子相似度
余弦相似度
蒙格埃尔坎

包含所有这些实现的库称为SimMetrics ，它同时具有 java 和 c# 实现。

score 24 · Accepted Answer

我发现Levenshtein和Jaro Winkler非常适合字符串之间的细微差异，例如：

拼写错误; 或者
ö代替人名中的o 。

然而，当比较文章标题之类的内容时，其中大部分文本相同但边缘有“噪音”，Smith-Waterman-Gotoh非常棒：

比较这两个标题（相同但不同来源的措辞不同）：

一种来自大肠杆菌的核酸内切酶，可在紫外线照射的 DNA 中引入单多核苷酸链断裂

核酸内切酶 III：一种来自大肠杆菌的核酸内切酶，可在紫外线照射的 DNA 中引入单多核苷酸链断裂

这个提供字符串算法比较的网站显示：

莱文斯坦：81
史密斯-沃特曼后藤 94
雅罗·温克勒 78

Jaro Winkler 和 Levenshtein 在检测相似性方面的能力不如 Smith Waterman Gotoh。如果我们比较两个不同的标题，但有一些匹配的文本：

高等植物的脂肪代谢。酰基硫酯酶在酰基辅酶A和酰基载体蛋白s代谢中的作用

高等植物的脂肪代谢。复杂脂质混合物中酰基-酰基载体蛋白和酰基辅酶A的测定

Jaro Winkler 给出了误报，但 Smith Waterman Gotoh 没有：

莱文斯坦：54
史密斯-沃特曼后藤 49
雅罗·温克勒 89

正如Anastasiosyal指出的那样，SimMetrics拥有这些算法的 java 代码。我使用 SimMetrics 中的SmithWatermanGotoh java 代码取得了成功。

score 7 · Accepted Answer

这是我对 Damerau Levenshtein Distance 的实现，它不仅返回相似系数，还返回更正单词中的错误位置（此功能可在文本编辑器中使用）。我的实现也支持不同权重的错误（替换、删除、插入、转置）。

public static List<Mistake> OptimalStringAlignmentDistance(
  string word, string correctedWord,
  bool transposition = true,
  int substitutionCost = 1,
  int insertionCost = 1,
  int deletionCost = 1,
  int transpositionCost = 1)
{
    int w_length = word.Length;
    int cw_length = correctedWord.Length;
    var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
    var result = new List<Mistake>(Math.Max(w_length, cw_length));

    if (w_length == 0)
    {
        for (int i = 0; i < cw_length; i++)
            result.Add(new Mistake(i, CharMistakeType.Insertion));
        return result;
    }

    for (int i = 0; i <= w_length; i++)
        d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);

    for (int j = 0; j <= cw_length; j++)
        d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);

    for (int i = 1; i <= w_length; i++)
    {
        for (int j = 1; j <= cw_length; j++)
        {
            bool equal = correctedWord[j - 1] == word[i - 1];
            int delCost = d[i - 1, j].Key + deletionCost;
            int insCost = d[i, j - 1].Key + insertionCost;
            int subCost = d[i - 1, j - 1].Key;
            if (!equal)
                subCost += substitutionCost;
            int transCost = int.MaxValue;
            if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
            {
                transCost = d[i - 2, j - 2].Key;
                if (!equal)
                    transCost += transpositionCost;
            }

            int min = delCost;
            CharMistakeType mistakeType = CharMistakeType.Deletion;
            if (insCost < min)
            {
                min = insCost;
                mistakeType = CharMistakeType.Insertion;
            }
            if (subCost < min)
            {
                min = subCost;
                mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
            }
            if (transCost < min)
            {
                min = transCost;
                mistakeType = CharMistakeType.Transposition;
            }

            d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
        }
    }

    int w_ind = w_length;
    int cw_ind = cw_length;
    while (w_ind >= 0 && cw_ind >= 0)
    {
        switch (d[w_ind, cw_ind].Value)
        {
            case CharMistakeType.None:
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Substitution:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Deletion:
                result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
                w_ind--;
                break;
            case CharMistakeType.Insertion:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
                cw_ind--;
                break;
            case CharMistakeType.Transposition:
                result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
                w_ind -= 2;
                cw_ind -= 2;
                break;
        }
    }
    if (d[w_length, cw_length].Key > result.Count)
    {
        int delMistakesCount = d[w_length, cw_length].Key - result.Count;
        for (int i = 0; i < delMistakesCount; i++)
            result.Add(new Mistake(0, CharMistakeType.Deletion));
    }

    result.Reverse();

    return result;
}

public struct Mistake
{
    public int Position;
    public CharMistakeType Type;

    public Mistake(int position, CharMistakeType type)
    {
        Position = position;
        Type = type;
    }

    public override string ToString()
    {
        return Position + ", " + Type;
    }
}

public enum CharMistakeType
{
    None,
    Substitution,
    Insertion,
    Deletion,
    Transposition
}

此代码是我的项目的一部分：Yandex-Linguistics.NET。

我写了一些测试，在我看来该方法有效。

但欢迎评论和评论。

score 4 · Accepted Answer

这是另一种方法：

寻找相似性的典型方法是 Levenshtein 距离，毫无疑问，有一个带有代码的库。

不幸的是，这需要比较每个字符串。如果距离大于某个阈值，您也许可以编写专门版本的代码来缩短计算，您仍然需要进行所有比较。

另一个想法是使用 trigrams 或 n-grams 的一些变体。这些是 n 个字符的序列（或 n 个单词或 n 个基因组序列或 n 个）。保留三元组到字符串的映射，并选择重叠最大的那些。n 的典型选择是“3”，因此得名。

例如，英语会有这些三元组：

英文
ngl
格力
利斯
是的

英格兰将拥有：

英文
ngl
格拉
局域网
和

好吧，7 场比赛中有 2 场（或 10 场比赛中有 4 场）匹配。如果这对您有用，您可以索引 trigram/string 表并获得更快的搜索。

您还可以将其与 Levenshtein 结合使用，以减少与具有一些最小数量的共同 n-gram 的比较集。

score 0 · Accepted Answer

这是一个 VB.net 实现：

Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
    Dim cost(v1.Length, v2.Length) As Integer
    If v1.Length = 0 Then
        Return v2.Length                'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
    ElseIf v2.Length = 0 Then
        Return v1.Length                'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
    Else
        'setup the base costs for inserting the correct characters
        For v1Count As Integer = 0 To v1.Length
            cost(v1Count, 0) = v1Count
        Next v1Count
        For v2Count As Integer = 0 To v2.Length
            cost(0, v2Count) = v2Count
        Next v2Count
        'now work out the cheapest route to having the correct characters
        For v1Count As Integer = 1 To v1.Length
            For v2Count As Integer = 1 To v2.Length
                'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
                'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1), 
                'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and 
                cost(v1Count, v2Count) = Math.Min(
                    cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
                    Math.Min(
                        cost(v1Count - 1, v2Count) + 1,
                        cost(v1Count, v2Count - 1) + 1
                    )
                )
            Next v2Count
        Next v1Count

        'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
        'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
        Return cost(v1.Length, v2.Length)
    End If
End Function

c# - 如何计算给定2个字符串的距离相似性度量？

7 回答 7

Related

Reference