c - Murmurhash2 无符号整数溢出

Question

我目前正在尝试实现一个哈希表/trie，但是当我将参数传递给 murmurhash2 时，它会返回一个数字，但我得到 unsigned int 溢出的运行时错误：

test.c:53:12：运行时错误：无符号整数溢出：24930 * 1540483477 不能用“无符号整数”类型表示

test.c:60:4: 运行时错误：无符号整数溢出：2950274797 * 1540483477 不能用“无符号整数”类型表示 6265

我在第 53 行和第 60 行放了一堆星星（*）

我不确定我是否传递了一些错误的参数。任何帮助将不胜感激！

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed );

int main(void)
{
   const char* s= "aa";
   unsigned int number= MurmurHash2 (s, (int)strlen(s), 1) % 10000;
   printf("%u\n", number);
}

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.

const unsigned int m = 0x5bd1e995;
const int r = 24;

// Initialize the hash to a 'random' value

unsigned int h = seed ^ len;

// Mix 4 bytes at a time into the hash

const unsigned char * data = (const unsigned char *)key;

while(len >= 4)
{
    unsigned int k = *(unsigned int *)data;

    k *= m;
    k ^= k >> r;
    k *= m;

    h *= m;
    h ^= k;

    data += 4;
    len -= 4;
}

// Handle the last few bytes of the input array

switch(len)
{
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0];
        h *= m; ************************************************
};

// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.

h ^= h >> 13;
h *= m;   **************************************
h ^= h >> 15;

return h;
}

score 4 · Accepted Answer

似乎您正在使用 UBSan 选项-fsanitize=unsigned-integer-overflow或其他类似的选项进行构建，以-fsanitize=integer启用此检查。文档说：

请注意，与有符号整数溢出不同，无符号整数不是未定义的行为。然而，虽然它具有明确定义的语义，但它通常是无意的，因此 UBSan 提出要抓住它。

在 MurmurHash 的情况下，乘法中的无符号整数溢出是完全有意的，因此您应该禁用该选项。

如果您-fsanitize=unsigned-integer-overflow明确使用，请将其删除。
如果它由另一个选项启用，则通过-fno-sanitize=unsigned-integer-overflow.
MurmurHash2或者，使用注释函数__attribute__((no_sanitize("unsigned-integer-overflow")))。

另一个注意事项：您的代码似乎是从MurmurHash2 的 32 位参考实现中复制的，它假定 32 位ints。您应该考虑uint32_t改用。

score 0 · Accepted Answer

unsigned int具有系统相关的位数。

在大多数系统上，这个数字是 32 位（4 字节），但有些系统可能使用不同的大小（即，在某些机器上是 64 位（8 字节））。

但是，杂音哈希“单词”是特定的大小。64 位变体需要 64 位无符号类型，32 位变体需要 32 位无符号类型。

这种不一致可以通过使用中定义的uint64_t或uint32_t类型来解决<stdint.h>。

我要补充一点，后缀UL（无符号长）可能应该添加到您使用的任何数字常量中。即2950274797UL * 1540483477UL。

正如@nwellnhof 所指出的，您的代码似乎使用了该算法的 32 位变体。

在这些情况下，乘法指令中的溢出是正常的（结果大于可用位数并被截断）。作为散列过程的一部分，这种数据丢失是可以接受的。

考虑使用以下命令通知编译器预期结果：

 h = (uint32_t)(((uint64_t)h * m) & 0xFFFFFFFF)

祝你好运！

c - Murmurhash2 无符号整数溢出

2 回答 2

Related

Reference