c - 存储大随机数的最佳哈希函数是什么？

Question

我想在数据结构中存储大量数字，为此，我想使用哈希函数，以便插入、删除或搜索可以快速。但我无法决定我应该使用哪个哈希函数？

总的来说，我想知道如何确定哈希函数对任何特定问题都有好处？

编辑：我认为人们对使用“随机”一词感到困惑。在这里随机，我的意思是，我没有任何特定的数字范围，我必须从中选择[任何 32 位整数]，但我有总数将被存储在数据结构中，比如大约 5000 个数字. 所以建议我在这种情况下最好的散列函数，为什么你认为它是最好的？

score 4 · Accepted Answer

如果数字是均匀随机的，只需使用选择低位的哈希函数。

unsigned hash_number(long long x)
{
    return (unsigned) x;
}

score 1 · Accepted Answer

即使您的输入数字是完全随机的，使用 h(x) = x 仍可能会带来性能问题。假设您的数字是从 0、2、4、...、2k 中随机选择的，尽管是随机的，但它们都不会映射到哈希表的第一个存储桶（存储桶 0），假设两个存储桶大小的幂。因此，真正重要的是输入数字的信息熵。

在您的情况下，一个很好的选择是 Thomas Wang 的整数散列函数，它是可逆的并保持良好的雪崩效应 ( http://en.wikipedia.org/wiki/Avalanche_effect )。有一篇文章描述了 Thomas Wang 的哈希函数及其逆： http: //naml.us/blog/2012/03。

score 0 · Accepted Answer

Your question doesn't make sense to me. Using a hashing algorithm to store some random numbers is overkill. If there is something more to the problem, the choice of data structure will depend on what this something more is (which you don't say).

If these numbers really are random or pseudorandom then all you need is a stack or circular buffer - the capability to add (push) a new random number into the data structure and the capability to remove (pop) an existing random number from the structure. If you want to retrieve them in order, use a circular buffer. A hashing function is worse in every respect than a simple stack (or circular buffer) for holding a list of random numbers - it is more complex, runs slower, and uses more memory.

Most languages/environments provide hash functions which can be used (or are provided as) "dictionary" classes, and these come with guidance as to efficiency. Generally, you can make dictionary classes faster by allocating more memory - they slow down when hash keys collide. So the "density" of actual numbers amongst all possible numbers matters.

So if you had to hold 100 such numbers, you could use a hash function which looked only at the last 12 bits. This gives 2^12 = 4096 possible hashes, so collisions will only occur 100/2048 of the time, less than 5%. On the other hand, you are using over 20 times as much memory as you should. (This function is the same as taking the modulus of the number to base 2^12, and is similar to what Epp suggested.)

Writing a storage class based on a hash function which properly handles hash collisions (as it must), gracefully handles duplicated data, won't freak if you chuck it bad data (like every number the same), and is efficient, is not a trivial task.

On the other hand, implementing a stack or circular buffer is extremely simple, very efficient, and has entirely predictable behaviour.

Are you sure you aren't making this more complicated than it needs to be?

c - 存储大随机数的最佳哈希函数是什么？

3 回答 3

Related

Reference