java - 哈希数组映射树 (HAMT)

Question

我试图了解HAMT的细节。为了理解，我自己用Java实现了一个。我对 Tries 很熟悉，我想我了解了 HAMT 的主要概念。

基本上，

两种类型的节点：

核心价值

Key Value Node:
  K key
  V value

指数

Index Node:
  int bitmap (32 bits)
  Node[] table (max length of 32)

为对象生成 32 位散列。
一次遍历 5 位哈希。(0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31) 注意：最后一步（第 7 步）只有 2 位。
在每一步中，找到该 5 位整数在位图中的位置。例如integer==5 bitmap==00001
1. 如果该位为 1，则散列的该部分存在。
2. 如果该位为 0，则密钥不存在。
如果键存在，则通过计算位图中 0 和位置之间的 1 的数量来找到它在表中的索引。例如integer==6 bitmap==0101010101 index==3
1. 如果表指向键/值节点，则比较键。
2. 如果表指向一个索引节点，则转到 2 前进一步。

我不太了解的部分是碰撞检测和缓解。在链接的论文中，他提到了这一点：

然后将现有密钥插入新的子哈希表并添加新密钥。每多使用 5 位散列，冲突的概率就会降低 1/32。有时可能会消耗一个完整的 32 位散列，并且必须计算一个新的散列来区分这两个键。

如果我要计算一个“新”散列并将对象存储在该新散列中；你怎么能在结构中查找对象？在进行查找时，它不会生成“初始”哈希而不是“重新计算的哈希”。

我肯定错过了什么.....

顺便说一句：HAMT 表现相当不错，在我的测试中它位于哈希图和树图之间。

Data Structure                    Add time   Remove time     Sorted add time Sorted remove time   Lookup time     Size     
Java's Hash Map                   38.67 ms   18 ms           30 ms           15 ms                16.33 ms        8.19 MB        
HAMT                              68.67 ms   30 ms           57.33 ms        35.33 ms             33 ms           10.18 MB       
Java's Tree Map                   86.33 ms   78 ms           75 ms           39.33 ms             76 ms           8.79 MB        
Java's Skip List Map              111.33 ms  106 ms          72 ms           41 ms                72.33 ms        9.19 MB

score 7 · Accepted Answer

HAMT 是一种出色且高性能的结构，尤其是当需要不可变对象时，即每次修改后都会创建一个数据结构的新副本！

至于您关于哈希冲突的问题，我发现了一个 C#实现（现在有问题），它显示了它是如何工作的：在每次哈希冲突时，都会重新哈希键并递归重试查找，直到达到最大迭代限制。

目前，我还在函数式编程上下文中探索 HAMP 并学习现有代码。在Haskell 中有几个 HAMT 的参考实现是 Data.HshMap ，在Clojure 中是 PersistenceHashMap。

网络上还有一些其他更简单的实现不处理冲突，但它们有助于理解这个概念。他们在Haskell和OCaml中

我找到了一篇很好的总结文章，它用图片和Phil Bagwell的原始研究论文的链接描述了 HAMT 。

相关点：

在 F# 中实现 HAMT 时，我注意到此处描述的 popCount 函数实现非常重要，与链接中下一个答案中描述的幼稚实现相比，它提供了 10-15%。不是很好，但免费午餐。

当键可以是整数并且它们实现相关的PATRICIA/Radix trie时，相关的 IntMap 结构（Haskell及其到 F# 的端口）非常好。

我相信所有这些实现都非常适合在这些示例中学习高效的不可变数据结构和函数式语言——它们真的很相配！

score 7 · Accepted Answer

There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:

Or the key will collide with an existing one. In which case the existing key must be replaced with a sub-hash table and the next 5 bit hash of the existing key computed. If there is still a collision then this process is repeated until no collision occurs.

So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).

Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:

The hash function was tailored to give a 32 bit hash. The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 2^32 of a further collision.

If this doesn't seem to explain anything, say and I'll extend this answer with more detail.

score 1 · Accepted Answer

如果我要计算一个“新”散列并将对象存储在该新散列中；你怎么能在结构中查找对象？在进行查找时，它不会生成“初始”哈希而不是“重新计算的哈希”。

在进行查找时，会使用初始哈希。当初始哈希中的位用尽时，以下任一条件为真：

我们最终得到一个键/值节点——返回它
我们最终得到了一个索引节点——这暗示我们必须通过重新计算一个新的哈希来更深入。

这里的关键是哈希位耗尽。

score 0 · Accepted Answer

碰撞的可能性可能非常低，通常只对大树有问题。鉴于此，您最好将冲突存储在叶子的数组中并线性搜索（我在 C# HAMT 中执行此操作）。

java - 哈希数组映射树 (HAMT)

4 回答 4

Related

Reference