0

有 HyperLogLog 算法,但它相当复杂。

有没有更简单的节省空间的方法可以用几行代码来表达?

4

1 回答 1

2

HyperLogLog 本身非常简单,只要你理解了它(并且已经有了哈希函数)。

我在 Python 中实现了一个教学版本,其中 5 行用于元素插入,另外 6 行用于估计:

from mmhash import mmhash
from math import log
from base64 import b64encode

class HyperLogLog:
    def __init__(self, log2m):
        self.log2m = log2m
        self.m = 1 << log2m
        self.data = [0]*self.m
        self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m

    def offer(self, o):
        x = mmhash(str(o), 0)
        a, b = 32-self.log2m, self.log2m
        i = x >> a
        v = self._bitscan(x << b, a)
        self.data[i] = max(self.data[i], v)

    def count(self):
        estimate = self.alphaMM / sum([2**-v for v in self.data])
        if estimate <= 2.5 * self.m:
            zeros = float(self.data.count(0))
            return round(-self.m * log(zeros / self.m))
        else:
            return round(estimate)

    def _bitscan(self, x, m):
        v = 1
        while v<=m and not x&0x80000000:
            v+=1
            x<<=1
        return v

完整的实现可以在这里找到:

https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py

于 2015-05-18T13:03:16.223 回答