我需要计算 2 个字符串之间的相似度。那我到底是什么意思?让我用一个例子来解释:
- 真话:
hospital
- 错字:
haspita
现在我的目标是确定我需要多少个字符来修改错误的单词以获得真实的单词。在这个例子中,我需要修改 2 个字母。那么百分比是多少呢?我总是取真实单词的长度。所以它变成 2 / 8 = 25% 所以这 2 个给定的字符串 DSM 是 75%。
我怎样才能在性能成为关键考虑因素的情况下实现这一目标?
我需要计算 2 个字符串之间的相似度。那我到底是什么意思?让我用一个例子来解释:
hospital
haspita
现在我的目标是确定我需要多少个字符来修改错误的单词以获得真实的单词。在这个例子中,我需要修改 2 个字母。那么百分比是多少呢?我总是取真实单词的长度。所以它变成 2 / 8 = 25% 所以这 2 个给定的字符串 DSM 是 75%。
我怎样才能在性能成为关键考虑因素的情况下实现这一目标?
几周前我刚刚解决了这个完全相同的问题。既然有人现在问,我会分享代码。在我详尽的测试中,即使没有提供最大距离,我的代码也比 Wikipedia 上的 C# 示例快 10 倍左右。当提供最大距离时,此性能增益增加到 30x - 100x +。请注意性能的几个关键点:
ints
比较速度比chars
.代码(如果在参数声明中替换int[]
为,它的工作原理完全相同:String
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
在哪里Swap
:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
您要查找的内容称为编辑距离或Levenshtein 距离。维基百科文章解释了它是如何计算的,并在底部有一段很好的伪代码,可以帮助您非常轻松地用 C# 编写这个算法。
这是下面链接的第一个站点的实现:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
我发现Levenshtein和Jaro Winkler非常适合字符串之间的细微差异,例如:
然而,当比较文章标题之类的内容时,其中大部分文本相同但边缘有“噪音”,Smith-Waterman-Gotoh非常棒:
比较这两个标题(相同但不同来源的措辞不同):
一种来自大肠杆菌的核酸内切酶,可在紫外线照射的 DNA 中引入单多核苷酸链断裂
核酸内切酶 III:一种来自大肠杆菌的核酸内切酶,可在紫外线照射的 DNA 中引入单多核苷酸链断裂
这个提供字符串算法比较的网站显示:
Jaro Winkler 和 Levenshtein 在检测相似性方面的能力不如 Smith Waterman Gotoh。如果我们比较两个不同的标题,但有一些匹配的文本:
高等植物的脂肪代谢。酰基硫酯酶在酰基辅酶A和酰基载体蛋白s代谢中的作用
高等植物的脂肪代谢。复杂脂质混合物中酰基-酰基载体蛋白和酰基辅酶A的测定
Jaro Winkler 给出了误报,但 Smith Waterman Gotoh 没有:
正如Anastasiosyal指出的那样,SimMetrics拥有这些算法的 java 代码。我使用 SimMetrics 中的SmithWatermanGotoh java 代码取得了成功。
这是我对 Damerau Levenshtein Distance 的实现,它不仅返回相似系数,还返回更正单词中的错误位置(此功能可在文本编辑器中使用)。我的实现也支持不同权重的错误(替换、删除、插入、转置)。
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
此代码是我的项目的一部分:Yandex-Linguistics.NET。
我写了一些测试,在我看来该方法有效。
但欢迎评论和评论。
这是另一种方法:
寻找相似性的典型方法是 Levenshtein 距离,毫无疑问,有一个带有代码的库。
不幸的是,这需要比较每个字符串。如果距离大于某个阈值,您也许可以编写专门版本的代码来缩短计算,您仍然需要进行所有比较。
另一个想法是使用 trigrams 或 n-grams 的一些变体。这些是 n 个字符的序列(或 n 个单词或 n 个基因组序列或 n 个)。保留三元组到字符串的映射,并选择重叠最大的那些。n 的典型选择是“3”,因此得名。
例如,英语会有这些三元组:
英格兰将拥有:
好吧,7 场比赛中有 2 场(或 10 场比赛中有 4 场)匹配。如果这对您有用,您可以索引 trigram/string 表并获得更快的搜索。
您还可以将其与 Levenshtein 结合使用,以减少与具有一些最小数量的共同 n-gram 的比较集。
这是一个 VB.net 实现:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function