c++ - C++ std::unordered_map 中使用的默认哈希函数是什么？

Question

我在用

unordered_map<string, int>

和

unordered_map<int, int>

每种情况下使用什么哈希函数，每种情况下发生冲突的可能性是多少？我将在每种情况下分别插入唯一字符串和唯一 int 作为键。

我有兴趣了解在字符串和 int 键及其冲突统计的情况下的哈希函数算法。

score 123 · Accepted Answer

使用函数对象std::hash<>。

所有内置类型以及一些其他标准库类型（例如std::string和）都存在标准特化std::thread。查看完整列表的链接。

对于要在 a 中使用的其他类型std::unordered_map，您必须专门std::hash<>化或创建自己的函数对象。

碰撞的可能性完全取决于实现，但考虑到整数被限制在定义的范围内，而字符串理论上无限长，我想说与字符串发生碰撞的机会要大得多。

至于在 GCC 中的实现，内置类型的特化只是返回位模式。以下是它们的定义方式bits/functional_hash.h：

  /// Partial specializations for pointer types.
  template<typename _Tp>
    struct hash<_Tp*> : public __hash_base<size_t, _Tp*>
    {
      size_t
      operator()(_Tp* __p) const noexcept
      { return reinterpret_cast<size_t>(__p); }
    };

  // Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp)     \
  template<>                        \
    struct hash<_Tp> : public __hash_base<size_t, _Tp>  \
    {                                                   \
      size_t                                            \
      operator()(_Tp __val) const noexcept              \
      { return static_cast<size_t>(__val); }            \
    };

  /// Explicit specialization for bool.
  _Cxx_hashtable_define_trivial_hash(bool)

  /// Explicit specialization for char.
  _Cxx_hashtable_define_trivial_hash(char)

  /// ...

的专业化std::string定义为：

#ifndef _GLIBCXX_COMPATIBILITY_CXX0X
  /// std::hash specialization for string.
  template<>
    struct hash<string>
    : public __hash_base<size_t, string>
    {
      size_t
      operator()(const string& __s) const noexcept
      { return std::_Hash_impl::hash(__s.data(), __s.length()); }
    };

一些进一步的搜索导致我们：

struct _Hash_impl
{
  static size_t
  hash(const void* __ptr, size_t __clength,
       size_t __seed = static_cast<size_t>(0xc70f6907UL))
  { return _Hash_bytes(__ptr, __clength, __seed); }
  ...
};
...
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);

_Hash_bytes是来自的外部函数libstdc++。更多的搜索使我找到了这个文件，其中指出：

// This file defines Hash_bytes, a primitive used for defining hash
// functions. Based on public domain MurmurHashUnaligned2, by Austin
// Appleby.  http://murmurhash.googlepages.com/

所以 GCC 对字符串使用的默认散列算法是 MurmurHashUnaligned2。

score 16 · Accepted Answer

GCC C++11 使用 Austin Appleby 的“MurmurHashUnaligned2”

尽管散列算法依赖于编译器，但我将针对 GCC C++11 介绍它。@Avidan Borisov 敏锐地发现用于字符串的 GCC 散列算法是 Austin Appleby 的“MurmurHashUnaligned2”。我做了一些搜索，在 Github 上找到了 GCC 的镜像副本。所以：

用于unordered_map（哈希表模板）和unordered_set（哈希集模板）的 GCC C++11 哈希函数如下所示。

感谢 Avidan Borisov的背景研究，他在关于使用什么是 GCC C++11 哈希函数的问题上指出 GCC 使用了 Austin Appleby 的“MurmurHashUnaligned2”的实现（参见http://murmurhash.googlepages.com /和https://github.com/aappleby/smhasher）。
在文件“gcc/libstdc++-v3/libsupc++/hash_bytes.cc”中，在这里（https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc），我发现实施。例如，这是“32 位 size_t”返回值的那个（2017 年 8 月 11 日拉取）

代码：

// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
  const size_t m = 0x5bd1e995;
  size_t hash = seed ^ len;
  const char* buf = static_cast<const char*>(ptr);

  // Mix 4 bytes at a time into the hash.
  while (len >= 4)
  {
    size_t k = unaligned_load(buf);
    k *= m;
    k ^= k >> 24;
    k *= m;
    hash *= m;
    hash ^= k;
    buf += 4;
    len -= 4;
  }

  // Handle the last few bytes of the input array.
  switch (len)
  {
    case 3:
      hash ^= static_cast<unsigned char>(buf[2]) << 16;
      [[gnu::fallthrough]];
    case 2:
      hash ^= static_cast<unsigned char>(buf[1]) << 8;
      [[gnu::fallthrough]];
    case 1:
      hash ^= static_cast<unsigned char>(buf[0]);
      hash *= m;
  };

  // Do a few final mixes of the hash.
  hash ^= hash >> 13;
  hash *= m;
  hash ^= hash >> 15;
  return hash;
}

Austin Appleby 哈希函数的最新版本是“MurmurHash3”，它已发布到公共领域！

奥斯汀在他的自述文件中说：

SMHasher 套件还包括MurmurHash3，这是 MurmurHash 函数系列中的最新版本 - 新版本更快、更健壮，其变体可以在 x86 和 x64 平台上高效地生成 32 位和 128 位哈希值。

对于 MurmurHash3 的源代码，请参见此处：

最棒的是！？它是公共领域的软件。这是正确的！文件的顶部状态：

// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.

所以，如果你想在你的开源软件、个人项目或专有软件中使用 MurmurHash3，包括用 C 实现你自己的哈希表，那就去吧！

如果您想要构建指令来构建和测试他的 MurmurHash3 代码，我在这里写了一些：https ://github.com/ElectricRCAircraftGuy/smhasher/blob/add_build_instructions/build/README.md 。希望我打开的这个 PR被接受，然后它们最终会出现在他的主仓库中。但是，在那之前，请参阅我的 fork 中的构建说明。

对于其他散列函数，包括`djb2`，以及 K&R 散列函数的 2 个版本...

...（一个显然很糟糕，一个非常好），请在此处查看我的另一个答案：hash function for string。

也可以看看：

https://en.wikipedia.org/wiki/MurmurHash
进一步研究：看看这些哈希函数速度基准：https ://github.com/fredrikwidlund/hash-function-benchmark （感谢@lfmunoz指出这一点）

c++ - C++ std::unordered_map 中使用的默认哈希函数是什么？

2 回答 2

GCC C++11 使用 Austin Appleby 的“MurmurHashUnaligned2”

Austin Appleby 哈希函数的最新版本是“MurmurHash3”，它已发布到公共领域！

对于其他散列函数，包括djb2，以及 K&R 散列函数的 2 个版本...

也可以看看：

Related

Reference

对于其他散列函数，包括`djb2`，以及 K&R 散列函数的 2 个版本...