hash-collision - 修剪后的 SHA1 哈希的冲突率

Question

使用我的 web 应用程序，我将具有哈希生成文件名的缓存文件存储在各种子目录中，以优化性能水平。我知道我可以提高性能的一种方法是确保生成的名称遵循 8.3 文件名结构，这样 NTFS 就不必生成短文件名（我无法在注册表中设置它）。

为了做到这一点，尽管我必须将哈希（我在想 SHA1）修剪为 8 个字符，但显然这会大大增加冲突的可能性。我想知道碰撞的概率是多少？

我在这里看到了关于完整 SHA1 哈希冲突率的答案，但我的数学很糟糕，所以计算这个值远远超出了我的范围。

score 5 · Accepted Answer

Since SHA-1's output is uniformly distributed, you can approximate the collision rate using the Birthday Paradox:

Assume you keep n bits of the SHA-1 output, there is a ~50% chance that you would have a collision in a set containing 2^(n/2) records, or in other words your collision rate is approximately 1/2^(n/2)

If you need a more accurate answer, you can always use the formula in the answer you've referenced in your question.

So here, if we assume each character is 1 Byte (8 bits), then you will most likely encounter a collision if you have ~2^(8*8/2) = 4294967296 records (therefore the collision rate is going to be 2.32 * 10^-8 which is very small).

Considering the collision rate you have discovered using your test program, the ToSHA1Fingerprint() function returns a Hexadecimal string which means an 8 character sub-string of it only represents 4 bytes and hence the approximate collision rate based on the above formula is 1/2^(4*8/2) = 0.000015258789 or 0.002%.

score 0 · Accepted Answer

看起来碰撞率对我的需求来说太高了，我正在使用以下代码进行 ~0.004% 的测试。

const int Iterations = 10;
const int Maxitems = 360000;

for (int i = 0; i < Iterations; i++)
{
    List<string> paths = new List<string>();

    for (int j = 0; j < Maxitems; j++)
    {
        string path = Path.GetRandomFileName().ToSHA1Fingerprint()
                                              .Substring(0, 8);

        paths.Add(path);
    }

    int count = paths.Distinct().Count();

    double collisionRate = ((Maxitems - count) * 100D) / Maxitems;
    collisions.Add(collisionRate);
}

double averageCollisionRate = collisions.Average();

hash-collision - 修剪后的 SHA1 哈希的冲突率

2 回答 2

Related

Reference