53

Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. I've used the cryptographic hash functions for this in the past because they are easy to implement, but they are doing a lot more work than they should to be cryptographically oneway, and I don't care about that (I'm just using the hashcode as a key into a hashtable).

Here's what I have today:

struct SomeData : IEquatable<SomeData>
{
    private readonly byte[] data;
    public SomeData(byte[] data)
    {
        if (null == data || data.Length <= 0)
        {
            throw new ArgumentException("data");
        }
        this.data = new byte[data.Length];
        Array.Copy(data, this.data, data.Length);
    }

    public override bool Equals(object obj)
    {
        return obj is SomeData && Equals((SomeData)obj);
    }

    public bool Equals(SomeData other)
    {
        if (other.data.Length != data.Length)
        {
            return false;
        }
        for (int i = 0; i < data.Length; ++i)
        {
            if (data[i] != other.data[i])
            {
                return false;
            }
        }
        return true;
    }
    public override int GetHashCode()
    {
        return BitConverter.ToInt32(new MD5CryptoServiceProvider().ComputeHash(data), 0);
    }
}

Any thoughts?


dp: You are right that I missed a check in Equals, I have updated it. Using the existing hashcode from the byte array will result in reference equality (or at least that same concept translated to hashcodes). for example:

byte[] b1 = new byte[] { 1 };
byte[] b2 = new byte[] { 1 };
int h1 = b1.GetHashCode();
int h2 = b2.GetHashCode();

With that code, despite the two byte arrays having the same values within them, they are referring to different parts of memory and will result in (probably) different hash codes. I need the hash codes for two byte arrays with the same contents to be equal.

4

11 回答 11

67

The hash code of an object does not need to be unique.

The checking rule is:

  • Are the hash codes equal? Then call the full (slow) Equals method.
  • Are the hash codes not equal? Then the two items are definitely not equal.

All you want is a GetHashCode algorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTable or Dictionary<> will need to use the hash to optimise retrieval.

How long do you expect the data to be? How random? If lengths vary greatly (say for files) then just return the length. If lengths are likely to be similar look at a subset of the bytes that varies.

GetHashCode should be a lot quicker than Equals, but doesn't need to be unique.

Two identical things must never have different hash codes. Two different objects should not have the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers).

于 2008-08-19T15:17:05.937 回答
51

不要对哈希表使用加密哈希,这很荒谬/矫枉过正。

来吧...在 C# 中修改 FNV 哈希

http://bretm.home.comcast.net/hash/6.html

    public static int ComputeHash(params byte[] data)
    {
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < data.Length; i++)
                hash = (hash ^ data[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }
于 2009-01-22T04:55:49.990 回答
13

借用 JetBrains 软件生成的代码,我确定了这个函数:

    public override int GetHashCode()
    {
        unchecked
        {
            var result = 0;
            foreach (byte b in _key)
                result = (result*31) ^ b;
            return result;
        }
    }

仅对字节进行异或的问题在于,返回值的 3/4(3 个字节)只有 2 个可能的值(全部打开或全部关闭)。这会将位分散更多。

在 Equals 中设置断点是一个很好的建议。将我的数据的大约 200,000 个条目添加到字典中,会看到大约 10 个 Equals 调用(或 1/20,000)。

于 2009-01-08T17:37:53.877 回答
4

您是否与SHA1CryptoServiceProvider.ComputeHash方法进行了比较?它需要一个字节数组并返回一个 SHA1 哈希,我相信它已经得到了很好的优化。我在一个Identicon 处理程序中使用它,该处理程序在负载下表现得非常好。

于 2008-08-19T15:53:28.983 回答
4

我发现了有趣的结果:

我有课:

public class MyHash : IEquatable<MyHash>
{        
    public byte[] Val { get; private set; }

    public MyHash(byte[] val)
    {
        Val = val;
    }

    /// <summary>
    /// Test if this Class is equal to another class
    /// </summary>
    /// <param name="other"></param>
    /// <returns></returns>
    public bool Equals(MyHash other)
    {
        if (other.Val.Length == this.Val.Length)
        {
            for (var i = 0; i < this.Val.Length; i++)
            {
                if (other.Val[i] != this.Val[i])
                {
                    return false;
                }
            }

            return true;
        }
        else
        {
            return false;
        }            
    }

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }
}

然后我创建了一个带有 MyHash 类型键的字典,以测试我可以插入多快,并且我还可以知道有多少冲突。我做了以下

        // dictionary we use to check for collisions
        Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();

        // used to generate random arrays
        Random rand = new Random();



        var now = DateTime.Now;

        for (var j = 0; j < 100; j++)
        {
            for (var i = 0; i < 5000; i++)
            {
                // create new array and populate it with random bytes
                byte[] randBytes = new byte[byte.MaxValue];
                rand.NextBytes(randBytes);

                MyHash h = new MyHash(randBytes);

                if (checkForDuplicatesDic.ContainsKey(h))
                {
                    Console.WriteLine("Duplicate");
                }
                else
                {
                    checkForDuplicatesDic[h] = true;
                }
            }
            Console.WriteLine(j);
            checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
        }

        var elapsed = DateTime.Now - now;

        Console.Read();

每次我向字典中插入一个新项目时,字典都会计算该对象的哈希值。因此,您可以通过将此处找到的几个答案放在方法中来判断哪种方法最有效。public override int GetHashCode()迄今为止最快且冲突次数最少的方法是:

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }

执行需要 2 秒。方法

    public override int GetHashCode()
    {
        // 7.1 seconds
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < Val.Length; i++)
                hash = (hash ^ Val[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }

也没有碰撞,但执行需要 7 秒!

于 2014-03-12T20:40:25.987 回答
2

If you are looking for performance, I tested a few hash keys, and I recommend Bob Jenkin's hash function. It is both crazy fast to compute and will give as few collisions as the cryptographic hash you used until now.

I don't know C# at all, and I don't know if it can link with C, but here is its implementation in C.

于 2008-08-19T15:16:24.720 回答
1

Is using the existing hashcode from the byte array field not good enough? Also note that in the Equals method you should check that the arrays are the same size before doing the compare.

于 2008-08-19T15:19:32.203 回答
1

生成一个好的哈希说起来容易做起来难。请记住,您基本上是用 m 位信息表示 n 个字节的数据。您的数据集越大,m 越小,发生冲突的可能性就越大……两条数据解析为相同的哈希值。

我学过的最简单的散列就是简单地将所有字节异或在一起。它比大多数复杂的散列算法和用于小型数据集的通用散列算法更容易、更快。这真的是散列算法的冒泡排序。由于简单的实现会给你留下 8 位,那只有 256 个哈希......不是那么热。您可以 XOR 块而不是单个字节,但是算法变得更加复杂。

所以当然,加密算法可能正在做一些你不需要的事情......但它们也是通用哈希质量的一大进步。您使用的 MD5 哈希有 128 位,有数十亿个可能的哈希值。您可能会得到更好的结果的唯一方法是从您希望通过您的应用程序的数据中获取一些具有代表性的样本,并在其上尝试各种算法以查看您获得了多少碰撞。

因此,在我找到不使用固定哈希算法的理由(也许是性能?)之前,我将不得不建议您坚持使用现有的算法。

于 2008-08-19T15:31:02.757 回答
1

无论您想要一个完美的哈希函数(每个对象的不同值,评估为相等)还是只是一个相当好的哈希函数总是一个性能折衷,通常需要时间来计算一个好的哈希函数,如果您的数据集很小,您最好使用一个快速的功能。最重要的(正如您的第二篇文章指出的那样)是正确性,要实现这一点,您只需要返回数组的长度即可。根据您的数据集,甚至可能没问题。如果不是(假设所有数组都一样长),您可以使用便宜的方法,例如查看第一个值和最后一个值并对它们的值进行异或运算,然后根据您的数据添加更多复杂性。

查看散列函数如何对数据执行的快速方法是将所有数据添加到散列表中,并计算调用 Equals 函数的次数,如果经常需要对函数执行更多工作。如果您这样做,请记住在开始时哈希表的大小需要设置为大于数据集,否则您将重新哈希数据,这将触发重新插入和更多 Equals 评估(尽管可能更现实?)

对于某些对象(不是这个),可以通过 ToString().GetHashCode() 生成快速 HashCode,当然不是最佳的,但很有用,因为人们倾向于从 ToString() 返回接近对象身份的东西,而这正是GetHashcode 正在寻找什么

琐事:我见过的最糟糕的性能是当有人错误地从 GetHashCode 返回一个常量时,尽管使用调试器很容易发现,特别是如果您在哈希表中进行大量查找

于 2008-09-09T22:35:54.877 回答
0

RuntimeHelpers.GetHashCode可能会有所帮助:

来自Msdn:

用作特定类型的散列函数,适用于散列算法和数据结构,如散列表。

于 2008-08-20T02:32:20.193 回答
0
private int? hashCode;

public override int GetHashCode()
{
    if (!hashCode.HasValue)
    {
        var hash = 0;
        for (var i = 0; i < bytes.Length; i++)
        {
            hash = (hash << 4) + bytes[i];
        }
        hashCode = hash;
    }
    return hashCode.Value;
}
于 2014-08-28T21:22:58.703 回答