我正在实现一个 pearson 散列,以便为一个需要与文件数据配对的文件名表的 C 项目创建一个轻量级字典结构 - 我想要散列表的良好常量搜索属性。我不是数学专家,所以我查找了良好的文本哈希值,然后 pearson 提出了它,据称它有效且分布良好。我测试了我的实现,发现无论我如何改变表大小或文件名最大长度,散列的效率都非常低,例如 18/50 的存储桶是空的。我相信维基百科不会撒谎,是的,我知道我可以下载第三方哈希表实现,但我非常想知道为什么我的版本不起作用。
在下面的代码中,(向表中插入值的函数),“csString”是文件名,要散列的字符串,“cLen”是字符串的长度,“pData”是指向某些数据的指针插入到表中,“pTable”是表结构。初始条件cHash = cLen - csString[0]是我通过实验发现可以略微提高均匀性的东西。我应该补充一点,我正在使用完全随机的字符串(使用 rand() 生成 ascii 值)测试表,其中随机长度在一定范围内 - 这是为了轻松生成和测试大量值。
typedef struct StaticStrTable {
unsigned int nRepeats;
unsigned char nBuckets;
unsigned char nMaxCollisions;
void** pBuckets;
} StaticStrTable;
static const char cPerm256[256] = {
227, 117, 238, 33, 25, 165, 107, 226, 132, 88, 84, 68, 217, 237, 228, 58, 52, 147, 46, 197, 191, 119, 211, 0, 218, 139, 196, 153, 170, 77, 175, 22, 193, 83, 66, 182, 151, 99, 11, 144, 104, 233, 166, 34, 177, 14, 194, 51, 30, 121, 102, 49,
222, 210, 199, 122, 235, 72, 13, 156, 38, 145, 137, 78, 65, 176, 94, 163, 95, 59, 92, 114, 243, 204, 224, 43, 185, 168, 244, 203, 28, 124, 248, 105, 10, 87, 115, 161, 138, 223, 108, 192, 6, 186, 101, 16, 39, 134, 123, 200, 190, 195, 178,
164, 9, 251, 245, 73, 162, 71, 7, 239, 62, 69, 209, 159, 3, 45, 247, 19, 174, 149, 61, 57, 146, 234, 189, 15, 202, 89, 111, 207, 31, 127, 215, 198, 231, 4, 181, 154, 64, 125, 24, 93, 152, 37, 116, 160, 113, 169, 255, 44, 36, 70, 225, 79,
250, 12, 229, 230, 76, 167, 118, 232, 142, 212, 98, 82, 252, 130, 23, 29, 236, 86, 240, 32, 90, 67, 126, 8, 133, 85, 20, 63, 47, 150, 135, 100, 103, 173, 184, 48, 143, 42, 54, 129, 242, 18, 187, 106, 254, 53, 120, 205, 155, 216, 219, 172,
21, 253, 5, 221, 40, 27, 2, 179, 74, 17, 55, 183, 56, 50, 110, 201, 109, 249, 128, 112, 75, 220, 214, 140, 246, 213, 136, 148, 97, 35, 241, 60, 188, 180, 206, 80, 91, 96, 157, 81, 171, 141, 131, 158, 1, 208, 26, 41
};
void InsertStaticStrTable(char* csString, unsigned char cLen, void* pData, StaticStrTable* pTable) {
unsigned char cHash = cLen - csString[0];
for (int i = 0; i < cLen; ++i) cHash ^= cPerm256[cHash ^ csString[i]];
unsigned short cTableIndex = cHash % pTable->nBuckets;
long long* pBucket = pTable->pBuckets[cTableIndex];
// Inserts data and records how many collisions there are - it may look weird as the way in which I decided to pack the data into the table buffer is very compact and arbitrary
// It won't affect the hash though, which is the key issue!
for (int i = 0; i < pTable->nMaxCollisions; ++i) {
if (i == 1) {
pTable->nRepeats++;
}
long long* pSlotID = pBucket + (i << 1);
if (pSlotID[0] == 0) {
pSlotID[0] = csString;
pSlotID[1] = pData;
break;
}
}
}