c++ - 具有低冲突率的 32 位整数的快速字符串散列算法

Question

我有很多不相关的命名事物，我想对其进行快速搜索。“土豚”总是到处都是“土豚”，因此散列字符串并重用整数可以很好地加快比较速度。整个名称集是未知的（并且随着时间的推移而变化）。什么是快速字符串散列算法，它将生成小（32 或 16）位值并具有低冲突率？

我希望看到特定于 C/C++ 的优化实现。

score 35 · Accepted Answer

35

Murmur Hash很不错。

于 2008-09-22T10:17:20.290 回答

score 30 · Accepted Answer

FNV 变体之一应满足您的要求。它们速度很快，并且产生相当均匀分布的输出。

score 17 · Accepted Answer

对于固定的字符串集，请使用 gperf。

如果您的字符串集发生更改，您必须选择一个哈希函数。这个话题之前已经讨论过：

使用 hash_map 时在 stl 字符串上使用的最佳散列算法是什么？

score 17 · Accepted Answer

在everlyconfuzzled.com上也有一篇不错的文章。

Jenkins 的字符串一次一次哈希应该是这样的：

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
        hash += *s;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

score 8 · Accepted Answer

根据您的用例，另一个可能更好的解决方案是interned strings。这就是符号的工作方式，例如在 Lisp 中。

实习字符串是一个字符串对象，其值是实际字符串字节的地址。因此，您通过签入全局表来创建一个内部字符串对象：如果该字符串在其中，则将内部字符串初始化为该字符串的地址。如果没有，则插入它，然后初始化您的实习字符串。

这意味着从同一字符串构建的两个内部字符串将具有相同的值，即一个地址。因此，如果 N 是系统中的留存字符串数，则特征为：

建设缓慢（需要查找和可能的内存分配）
在并发线程的情况下需要全局数据和同步
比较是 O(1)，因为您正在比较地址，而不是实际的字符串字节（这意味着排序效果很好，但它不会是字母排序）。

干杯，

卡尔

score 5 · Accepted Answer

一个好的主题永远不会迟到，我相信人们会对我的发现感兴趣。

我需要一个哈希函数，在阅读了这篇文章并对此处给出的链接进行了一些研究之后，我想出了 Daniel J Bernstein 算法的这种变体，我曾经做过一个有趣的测试：



unsigned long djb_hashl(const char *clave)
{
    unsigned long c,i,h;

    for(i=h=0;clave[i];i++)
    {
        c = toupper(clave[i]);
        h = ((h << 5) + h) ^ c;
    }
    return h;
}

这种变体忽略大小写对字符串进行哈希处理，这适合我对用户登录凭据进行哈希处理的需要。“clave”在西班牙语中是“key”。我为西班牙语感到抱歉，但它是我的母语，程序是写在上面的。

好吧，我编写了一个程序，它将生成从“test_aaaa”到“test_zzzz”的用户名，并且-为了使字符串更长-我在这个列表中添加了一个随机域：“cloud-nueve.com”、“yahoo.com” '、'gmail.com' 和 'hotmail.com'。因此，他们每个人看起来像：

test_aaaa@cloud-nueve.com, test_aaab@yahoo.com,
test_aaac@gmail.com、test_aaad@hotmail.com 等等。

这是测试的输出 -'Colision entre XXX y XXX' 表示'碰撞 XXX 和 XXX'。“palabras”的意思是“单词”，而“Total”在两种语言中都是一样的——。

    布斯坎多·科利西内斯...
    碰撞进入 'test_phiz@hotmail.com' y 'test_juxg@cloud-nueve.com' (1DB903B7)
    冲突进入 'test_rfhh@hotmail.com' y 'test_fpgo@yahoo.com' (2F5BC088)
    碰撞进入 'test_wxuj@hotmail.com' y 'test_pugy@cloud-nueve.com' (51FD09CC)
    碰撞进入 'test_sctb@gmail.com' y 'test_iohw@cloud-nueve.com' (52F5480E)
    碰撞进入 'test_wpgu@cloud-nueve.com' y 'test_seik@yahoo.com' (74FF72E2)
    冲突进入 'test_rfll@hotmail.com' y 'test_btgo@yahoo.com' (7FD70008)
    碰撞进入 'test_wcho@cloud-nueve.com' y 'test_scfz@gmail.com' (9BD351C4)
    碰撞进入 'test_swky@cloud-nueve.com' y 'test_fqpn@gmail.com' (A86953E1)
    冲突进入 'test_rftd@hotmail.com' y 'test_jlgo@yahoo.com' (BA6B0718)
    冲突进入 'test_rfpp@hotmail.com' y 'test_nxgo@yahoo.com' (D0523F88)
    碰撞进入 'test_zlgo@yahoo.com' y 'test_rfdd@hotmail.com' (DEE08108)
    总de Colisones：11
    总 de Palabras : 456976

这还不错，456,976 次中有 11 次冲突（当然使用完整的 32 位作为表长度）。

使用 5 个字符运行程序，即从“test_aaaaa”到“test_zzzzz”，实际上会耗尽构建表的内存。下面是输出。'No hay memoria para insertar XXXX (insertadas XXX)' 的意思是'没有内存可以插入 XXX (XXX 插入)'。基本上 malloc() 在那一点上失败了。

    没有干草记忆 para insertar 'test_epjcv' (insertadas 2097701)。

    Buscando Colisiones...

    ...451 'colision' strings...

    Total de Colisiones: 451
    Total de Palabras  : 2097701

Which means just 451 collisions on 2,097,701 strings. Note that in none of the occasions, there were more than 2 collisions per code. Which I confirm it is a great hash for me, as what I need is to convert the login ID to a 40 bit unique id for indexing. So I use this to convert the login credentials to a 32 bit hash and use the extra 8 bits to handle up to 255 collisions per code, which lookign at the test results would be almost impossible to generate.

Hope this is useful to someone.

EDIT:

Like the test box is AIX, I run it using LDR_CNTRL=MAXDATA=0x20000000 to give it more memory and it run longer, the results are here:

Buscando Colisiones... Total de Colisiones: 2908 Total de Palabras : 5366384

That is 2908 after 5,366,384 tries!!

VERY IMPORTANT: Compiling the program with -maix64 (so unsigned long is 64 bits), the number of collisions is 0 for all cases!!!

score 3 · Accepted Answer

Hsieh散列函数非常好，并且有一些基准/比较，作为 C 中的一般散列函数。根据您想要的（这并不完全明显），您可能想要考虑类似cdb的东西。

score 3 · Accepted Answer

为什么不直接使用Boost 库？它们的散列函数使用简单，Boost 中的大部分内容很快就会成为 C++ 标准的一部分。其中一些已经是。

Boost hash 就像

#include <boost/functional/hash.hpp>

int main()
{
    boost::hash<std::string> string_hash;

    std::size_t h = string_hash("Hash me");
}

您可以在boost.org上找到提升

score 3 · Accepted Answer

3

Bob Jenkins 有许多可用的哈希函数，所有这些函数都很快并且碰撞率低。

于 2008-12-16T21:30:58.723 回答

score 2 · Accepted Answer

2

看看 GNU gperf。

于 2008-09-22T10:06:20.597 回答

score 2 · Accepted Answer

在上一个问题中有一些很好的讨论

以及如何选择散列函数的一个很好的概述，以及关于这里几个常见函数分布的统计信息

score 2 · Accepted Answer

您可以使用 Reflector 查看 .NET 在 String.GetHashCode() 方法上的使用情况。

我会冒险猜测微软花了相当多的时间来优化这个。他们也打印在所有 MSDN 文档中，因此它随时可能发生变化。很明显，这是在他们的“性能调整雷达”上；-)

我想移植到 C++ 也很简单。

score 0 · Accepted Answer

Described here is a simple way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html

A snippet from the post:

if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format

score -3 · Accepted Answer

-3

CRC-32。谷歌上有大约一万亿个链接。

于 2008-09-22T10:06:47.397 回答

c++ - 具有低冲突率的 32 位整数的快速字符串散列算法

14 回答 14

Related

Reference