3

我正在使用hash_ring在服务器之间分发对象。我假设分布是统一的,因为它基于 MD5 散列。不幸的是,事实并非如此。

我正在使用使用生成的随机密钥uuid.uuid4()。我已经证实,MD5 本身实际上确实提供了均匀分布。但是,当我使用分配时hash_ring.HashRing,大多数和最少填充的存储桶之间存在 20-30% 的差异。

  • hash_ring可以通过调整一些设置来改善分布的均匀性吗?
  • 在 Python 中进行一致性哈希还有其他好的选择吗?

我用来测试分布均匀性的代码:

ring = hash_ring.HashRing(range(8))
for _ in range(10):
     counters = [0]*8
     for _ in range(100000):
         counters[ring.get_node(str(uuid.uuid4()))] += 1
     print counters

打印出来的是:

[11115, 11853, 14033, 11505, 13640, 12292, 12851, 12711]
[11164, 11833, 14024, 11562, 13365, 12302, 13002, 12748]
[11354, 11756, 14017, 11583, 13201, 12231, 13135, 12723]
[11182, 11672, 13936, 11441, 13563, 12240, 13129, 12837]
[11069, 11866, 14045, 11541, 13443, 12249, 12894, 12893]
[11196, 11791, 14158, 11533, 13517, 12319, 13039, 12447]
[11297, 11944, 14114, 11536, 13154, 12289, 12990, 12676]
[11250, 11770, 14145, 11657, 13340, 11960, 13161, 12717]
[11027, 11891, 14099, 11615, 13320, 12336, 12891, 12821]
[11148, 11677, 13965, 11557, 13503, 12309, 13123, 12718]

为了比较,我直接使用 MD5 进行了不一致的散列:

for _ in range(10):
    counters = [0]*8
    for _ in range(100000):
        counters[int(hashlib.md5(str(uuid.uuid4())).hexdigest(),16)%8] += 1
    print counters

结果更好:

[12450, 12501, 12380, 12643, 12446, 12444, 12506, 12630]
[12579, 12667, 12523, 12385, 12386, 12445, 12702, 12313]
[12624, 12449, 12451, 12401, 12580, 12449, 12562, 12484]
[12359, 12543, 12371, 12659, 12508, 12416, 12619, 12525]
[12425, 12526, 12565, 12732, 12381, 12481, 12335, 12555]
[12514, 12576, 12528, 12294, 12658, 12319, 12518, 12593]
[12500, 12471, 12460, 12502, 12637, 12393, 12442, 12595]
[12583, 12418, 12428, 12311, 12581, 12780, 12372, 12527]
[12395, 12569, 12544, 12319, 12607, 12488, 12424, 12654]
[12480, 12423, 12492, 12433, 12427, 12502, 12635, 12608]
4

1 回答 1

9

the hash ring sacrifices the "eveness" of your md5 test code to maintain mappings when the number of entries changes. see http://www.lexemetech.com/2007/11/consistent-hashing.html. so the differences you see are not because of uuid4, or because of an error, but because the library uses a different algorithm from your test.

if you changed the number of bins in your md5 code you'd need to change the modular division (the % 8) and suddenly (almost) all mappings would change. consistent hashing avoids this. that is why it can't use the same "obviously uniform" approach you do.

the downside of the consistency approach is that it's not perfectly uniform (it depends on the hashes of the the bins, rather than on the hashes of the objects you put in the bins, so you don't get the "evening out" you would expect as you add more objects). but you can increase uniformity by increasing the number of points on the ring - by increasing the number of "replicas" used in the code.

so assuming that the final api matches that shown at http://amix.dk/blog/viewEntry/19367 just set a larger value for replicas (try doubling it, or even just adding 1 - you're already pretty flat).


update: i looked around a bit more and this blog post http://amix.dk/blog/post/19369 describes the changes made to the latest code. it looks like that does use more replicas than 3 (i don't completely understand the code, sorry) and it also seems that it's now based on a well-known "standard" implementation.

于 2011-08-22T18:23:57.957 回答