c# - String.comparison 性能（带修剪）

Question

我需要做很多高性能的不区分大小写的字符串比较，并意识到我这样做的方式 .ToLower().Trim() 真的很愚蠢，因为所有新字符串都被分配了

所以我挖掘了一下，这种方式似乎更可取：

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

这里唯一的问题是我想忽略前导或尾随空格，即 Trim() 但如果我使用 Trim 我在字符串分配方面遇到同样的问题。我想我可以检查每个字符串，看看它是 StartsWith(" ") 还是 EndsWith(" ")，然后才修剪。要么找出每个字符串的索引、长度并传递给字符串。比较覆盖

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

但这似乎相当混乱，如果我不为两个字符串上的尾随和前导空格的每个组合制作一个非常大的 if-else 语句，我可能不得不使用一些整数......所以有什么优雅的解决方案的想法吗？

这是我目前的建议：

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

score 6 · Accepted Answer

这样的事情应该这样做：

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

例子：

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

输出：

您应该根据简单的修剪和比较一些真实数据来分析它，看看您将使用它是否真的有任何区别。

score 3 · Accepted Answer

我会使用你的代码

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

并根据需要添加任何.Trim()呼叫。这将在大多数情况下保存您的初始选项 4 个字符串（.ToLower().Trim()，并且始终保存两个字符串（.ToLower()）。

如果您在此之后遇到性能问题，那么您的“混乱”选项可能是最好的选择。

score 2 · Accepted Answer

首先确保您确实需要优化此代码。也许创建字符串的副本不会明显影响您的程序。

如果您确实需要优化，您可以尝试在首次存储字符串时而不是在比较它们时处理字符串（假设它发生在程序的不同阶段）。例如，存储字符串的修剪和小写版本，以便在比较它们时可以使用简单的等价检查。

score 2 · Accepted Answer

您不能只修剪（并可能使其小写）每个字符串一次（获得它时）吗？做更多听起来像过早的优化......

score 0 · Accepted Answer

问题是，如果需要完成，就必须完成。我认为您的任何不同解决方案都不会产生影响。在每种情况下，都需要进行多次比较才能找到空白或删除它。

显然，删除空格是问题的一部分，因此您不必担心。
如果您使用 unicode 字符并且可能比复制字符串慢，那么在比较之前将字符串小写是一个错误。

score 0 · Accepted Answer

关于过早优化的警告是正确的，但我假设您已经对此进行了测试，发现大量时间被浪费在复制字符串上。在这种情况下，我会尝试以下方法：

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLength 是一个查找“修剪”字符串的起始索引和长度的函数（这是未经测试的，但应该给出一般的想法）：

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

score 0 · Accepted Answer

您可以实现自己的StringComparer. 这是一个基本的实现：

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

评论：

它没有经过广泛测试，可能包含错误
性能尚待评估（但它可能比调用Trim更好ToLower）
该GetHashCode方法未实现，因此不要将其用作字典中的键

score 0 · Accepted Answer

我注意到您的第一个建议仅比较相等而不是排序，这可以进一步节省效率。

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

当然，什么是没有匹配哈希码生成器的相等检查器：

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

c# - String.comparison 性能（带修剪）

8 回答 8

Related

Reference