我正在尝试实现 Flajolet Martin 算法。我有一个包含超过 6000 条记录的数据集,但以下代码的输出是 4096。请帮助我理解我所犯的错误。
import xxhash
import math
def return_trailing_zeroes(s):
s = str(s)
rev = s[::-1]
count = 0
for i in rev:
if i is '0':
count = count + 1
else:
break
return count
def gethash(line):
num=abs(xxhash.xxh32(line).intdigest())
return num
fp=open("/content/drive/MyDrive/Data.txt","r")
h_max=0
for line in fp:
hash_value_1 = gethash(line)
binary_1 = format(hash_value_1, '032b')
t1 = return_trailing_zeroes(binary_1)
if t1>h_max:
h_max=t1
fp.close()
print(2**h_max)
我尝试了 HyperLogLog 算法的这种实现,以下代码的输出为 2560。
def return_trailing_zeroes(s): s = str(s) rev = s[::-1] count = 0 for i in rev: if i is '0': count = count + 1 else: break
return counth1_m=0
h2_m=0
h3_m=0
h4_m=0
fp=open("/content/drive/MyDrive/Data.txt","r")
对于 fp 中的行:
hash_value_1 = abs(xxhash.xxh32(line).intdigest())
hash_value_2 = abs(hash32(line))
hash_value_3 = abs(jhashcode.hashcode(line))
hash_value_4 = abs(mmh3.hash(line))
binary_1 = 格式(哈希值_1,'032b')
binary_2 = 格式(哈希值_2,'032b')
binary_3 = 格式(哈希值_3,'032b')
binary_4 = 格式(哈希值_4,'032b')
t1 = return_trailing_zeroes(binary_1)
t2 = return_trailing_zeroes(binary_2)
t3 = return_trailing_zeroes(binary_3)
t4 = return_trailing_zeroes(binary_4)
如果 t1>h1_m:h1_m=t1
如果 t2>h2_m:h2_m=t2
如果 t3>h3_m:h3_m=t3
如果 t4>h4_m:h4_m=t4
fp.close()
avg_hash12 = (2**(h1_m) + 2**(h2_m))/float(2)
avg_hash34 = (2**(h3_m) + 2**(h4_m))/float(2)
distinct_elements = math.ceil(statistics.median([avg_hash12, avg_hash34]))
打印(不同的元素)