c - Pearson 哈希 8 位实现产生非常不均匀的值

Question

我正在实现一个 pearson 散列，以便为一个需要与文件数据配对的文件名表的 C 项目创建一个轻量级字典结构 - 我想要散列表的良好常量搜索属性。我不是数学专家，所以我查找了良好的文本哈希值，然后 pearson 提出了它，据称它有效且分布良好。我测试了我的实现，发现无论我如何改变表大小或文件名最大长度，散列的效率都非常低，例如 18/50 的存储桶是空的。我相信维基百科不会撒谎，是的，我知道我可以下载第三方哈希表实现，但我非常想知道为什么我的版本不起作用。

在下面的代码中，（向表中插入值的函数），“csString”是文件名，要散列的字符串，“cLen”是字符串的长度，“pData”是指向某些数据的指针插入到表中，“pTable”是表结构。初始条件cHash = cLen - csString[0]是我通过实验发现可以略微提高均匀性的东西。我应该补充一点，我正在使用完全随机的字符串（使用 rand() 生成 ascii 值）测试表，其中随机长度在一定范围内 - 这是为了轻松生成和测试大量值。

typedef struct StaticStrTable {
    unsigned int nRepeats;
    unsigned char nBuckets;
    unsigned char nMaxCollisions;

    void** pBuckets;
} StaticStrTable;

static const char cPerm256[256] = {
    227, 117, 238, 33, 25, 165, 107, 226, 132, 88, 84, 68, 217, 237, 228, 58, 52, 147, 46, 197, 191, 119, 211, 0, 218, 139, 196, 153, 170, 77, 175, 22, 193, 83, 66, 182, 151, 99, 11, 144, 104, 233, 166, 34, 177, 14, 194, 51, 30, 121, 102, 49,
    222, 210, 199, 122, 235, 72, 13, 156, 38, 145, 137, 78, 65, 176, 94, 163, 95, 59, 92, 114, 243, 204, 224, 43, 185, 168, 244, 203, 28, 124, 248, 105, 10, 87, 115, 161, 138, 223, 108, 192, 6, 186, 101, 16, 39, 134, 123, 200, 190, 195, 178,
    164, 9, 251, 245, 73, 162, 71, 7, 239, 62, 69, 209, 159, 3, 45, 247, 19, 174, 149, 61, 57, 146, 234, 189, 15, 202, 89, 111, 207, 31, 127, 215, 198, 231, 4, 181, 154, 64, 125, 24, 93, 152, 37, 116, 160, 113, 169, 255, 44, 36, 70, 225, 79,
    250, 12, 229, 230, 76, 167, 118, 232, 142, 212, 98, 82, 252, 130, 23, 29, 236, 86, 240, 32, 90, 67, 126, 8, 133, 85, 20, 63, 47, 150, 135, 100, 103, 173, 184, 48, 143, 42, 54, 129, 242, 18, 187, 106, 254, 53, 120, 205, 155, 216, 219, 172,
    21, 253, 5, 221, 40, 27, 2, 179, 74, 17, 55, 183, 56, 50, 110, 201, 109, 249, 128, 112, 75, 220, 214, 140, 246, 213, 136, 148, 97, 35, 241, 60, 188, 180, 206, 80, 91, 96, 157, 81, 171, 141, 131, 158, 1, 208, 26, 41
};

void InsertStaticStrTable(char* csString, unsigned char cLen, void* pData, StaticStrTable* pTable) {
    unsigned char cHash = cLen - csString[0];

    for (int i = 0; i < cLen; ++i) cHash ^= cPerm256[cHash ^ csString[i]];
    
    unsigned short cTableIndex = cHash % pTable->nBuckets;
    long long* pBucket = pTable->pBuckets[cTableIndex];
    
    // Inserts data and records how many collisions there are - it may look weird as the way in which I decided to pack the data into the table buffer is very compact and arbitrary 
    // It won't affect the hash though, which is the key issue!

    for (int i = 0; i < pTable->nMaxCollisions; ++i) {
        if (i == 1) {
            pTable->nRepeats++;
        }

        long long* pSlotID = pBucket + (i << 1);

        if (pSlotID[0] == 0) {
            pSlotID[0] = csString;
            pSlotID[1] = pData;

            break;
        }
    }
}

score 1 · Accepted Answer

仅供参考（这不是答案，我只需要格式化）这些只是模拟 YMMV 的单次运行。

在 50 个 bin 中随机分布 50 个元素：

kalender_size=50 nperson = 50
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
  0:       18 (0.360000)        0 (0.000000)     0        0        0
  1:       18 (0.360000)       18 (0.360000)     1       18       18
  2:       10 (0.200000)       20 (0.400000)     3       30       48
  3:        4 (0.080000)       12 (0.240000)     6       24       72
----+---------+--------+----------+--------+------+--------+--------
  4:       50                  50                1.440000         72

同样：在生日日历上分配 365 人（忽略闰日...）：

kalender_size=356 nperson = 356
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
  0:      129 (0.362360)        0 (0.000000)     0        0        0
  1:      132 (0.370787)      132 (0.370787)     1      132      132
  2:       69 (0.193820)      138 (0.387640)     3      207      339
  3:       19 (0.053371)       57 (0.160112)     6      114      453
  4:        6 (0.016854)       24 (0.067416)    10       60      513
  5:        1 (0.002809)        5 (0.014045)    15       15      528
----+---------+--------+----------+--------+------+--------+--------
  6:      356                 356                1.483146        528

对于 N 个插槽上的 N 个项目，对number of empty slots和的期望number of slots with a single item in them是相等的。两者的预期密度均为 1/e。

最后的数字 (1.483146) 是每个找到的元素的 ->next 指针遍历次数（当使用链式哈希表时）任何最佳哈希函数几乎都会达到 1.5。

c - Pearson 哈希 8 位实现产生非常不均匀的值

1 回答 1

Related

Reference