c# - C#字符串的GetHashCode()是如何实现的？

Question

我只是好奇，因为我猜它会对性能产生影响。它是否考虑完整的字符串？如果是，那么长字符串会很慢。如果它只考虑字符串的一部分，它的性能会很差（例如，如果它只考虑字符串的开头，如果一个 HashSet 包含大部分相同的字符串，它的性能就会很差。

score 98 · Accepted Answer

当您有此类问题时，请务必获取参考源代码。它比您从反编译器中看到的要多得多。选择与您首选的 .NET 目标匹配的那个，该方法在版本之间发生了很大变化。我将在这里重现它的 .NET 4.5 版本，从 Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs 检索

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
            if(HashHelpers.s_UseRandomizedStringHashing)
            { 
                return InternalMarvin32HashString(this, this.Length, 0);
            } 
#endif // FEATURE_RANDOMIZED_STRING_HASHING 

            unsafe { 
                fixed (char *src = this) {
                    Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                    Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
                    int hash1 = (5381<<16) + 5381; 
#else 
                    int hash1 = 5381;
#endif 
                    int hash2 = hash1;

#if WIN32
                    // 32 bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length; 
                    while (len > 2) 
                    {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2;
                        len  -= 4;
                    } 

                    if (len > 0) 
                    { 
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                    } 
#else
                    int     c;
                    char *s = src;
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c;
                        c = s[1]; 
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c; 
                        s += 2;
                    }
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily.
                    // This is perfectly fine as long as you don't persist the 
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber; 
#endif
                    return hash1 + (hash2 * 1566083941);
                }
            } 
        }

这可能超出了您的预期，我将对代码进行一些注释：

#if 条件编译指令使该代码适应不同的 .NET 目标。FEATURE_XX 标识符在其他地方定义，并在整个 .NET 源代码中关闭功能。当目标是32位版本的框架时定义WIN32，64位版本的mscorlib.dll是单独构建的，存放在GAC的不同子目录中。
s_UseRandomizedStringHashing 变量启用了散列算法的安全版本，旨在让程序员避免做一些不明智的事情，例如使用 GetHashCode() 为密码或加密等内容生成散列。它由app.exe.config 文件中的条目启用
固定语句保持索引字符串便宜，避免了常规索引器完成的边界检查
第一个 Assert 确保字符串按照应有的方式以零结尾，这是允许循环中优化所必需的
第二个 Assert 确保字符串与应为 4 的倍数的地址对齐，这是保持循环性能所必需的
循环是手动展开的，对于 32 位版本，每个循环消耗 4 个字符。转换为 int* 是将 2 个字符（2 x 16 位）存储在 int（32 位）中的技巧。循环之后的额外语句处理长度不是 4 的倍数的字符串。请注意，零终止符可能包含在散列中，也可能不包含，如果长度是偶数则不会。它查看字符串中的所有字符，回答您的问题
The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

score 6 · Accepted Answer

检查源代码（由ILSpy提供），我们可以看到它确实迭代了字符串的长度。

// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
    IntPtr arg_0F_0;
    IntPtr expr_06 = arg_0F_0 = this;
    if (expr_06 != 0)
    {
        arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
    }
    char* ptr = arg_0F_0;
    int num = 352654597;
    int num2 = num;
    int* ptr2 = (int*)ptr;
    for (int i = this.Length; i > 0; i -= 4)
    {
        num = ((num << 5) + num + (num >> 27) ^ *ptr2);
        if (i <= 2)
        {
            break;
        }
        num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
        ptr2 += (IntPtr)8 / 4;
    }
    return num + num2 * 1566083941;
}

c# - C#字符串的GetHashCode()是如何实现的？

2 回答 2

Related

Reference