我看过Murmur3和Meow,但在散列长数组时,它们似乎都针对带宽进行了优化。我没有任何数组,我只有uint32_t
输入整数。我的输入是小的非负数,通常在几百万以下,都可以被 3 整除。对于某个整数 N,概率密度在 [0 .. N*3] 范围内是均匀的。
以下代码在性能和分发质量方面是否足够好?
// Count of bits in the complete Bloom filter
// Using 32kb or memory to hopefully stay in L1D cache
constexpr size_t countBits = 1024 * 32 * 8;
// These bytes are from random.org
// Cryptographic strength is irrelevant, we only need performance here.
static const __m128i seed = _mm_setr_epi8( (char)0x8a, (char)0xf4, 0x30, (char)0x91, (char)0x07, 0x05, 0x45, (char)0x99, 0x2f, (char)0x95, 0x4a, (char)0xa2, (char)0x84, (char)0x88, (char)0xe6, (char)0x09 );
// Mask to extract lower bits from the hash
static const __m128i andMask = _mm_set1_epi32( countBits - 1 );
// Compute 16 bytes of hash function, mask upper bits in every 32-bit lane, computing 4 bit positions for the Bloom filter.
inline __m128i hash( uint32_t key )
{
// Broadcast integer to 32-bit lanes
__m128i a = _mm_set1_epi32( (int)key );
// Abuse AES-NI decrypting instruction to make a 16-byte hash.
// That thing only takes 4 cycles of latency to complete.
a = _mm_aesdec_si128( a, seed );
// Mask upper bits in the lanes so the _mm_extract_epi32 returns position of bits in the Bloom filter to set or test
a = _mm_and_si128( a, andMask );
return a;
}
更新:这个问题不是基于意见的,因为散列冲突的计数是微不足道的。正是出于这个原因,我指定了输入值的分布。