3

我自己实现了HyperLogLog algorithm. 它工作得很好,但有时我必须获取很多(大约 10k-100k)的 HLL 结构并将它们合并。

我将它们中的每一个都存储为一个位字符串,所以首先我必须将每个位字符串转换为存储桶。由于有很多 HLL,它需要的时间比我想要的要多。

目前,大约 80% 的运行时会为每个 HLL 调用一次这行代码:

my @buckets = map { oct '0b'.$_ } unpack('(a5)1024', $bitstring);

有什么方法可以更快地做到这一点?

如果我们将 HyperLogLog 的定义抛在脑后,任务可以这样解释:假设它$bitstring由 1024 个 5 位计数器组成(因此每个计数器的值最高可达 32),我们必须将其转换为 1024 个整数的数组。

4

1 回答 1

6

The a denotes arbitrary, zero-padded binary data. Here, you treat that data as ASCII text but it can only contain 1 and 0! This is inefficient as a5 ends up using five bytes. The easiest and most efficient solution would be to store a 8-bit number for each counter, then: my @buckets = unpack 'C1024', $bitstring.

If you only want to store five bits per counter, you end up saving very little memory for very much hassle. You'd have to use something insane like this for a round-trip conversion:

my $bitstring = pack "(b5)1024", map { sprintf "%b", $_ } @buckets;
@buckets = map { oct "0b$_" } unpack "(b5)1024", $bitstring;
于 2014-05-15T19:21:58.857 回答