database - 数据库中的生日悖论（计算碰撞概率）

Question

根据Birthday paradox：

如果我将它应用于数据库（如果我错了请纠正我）：如果我们需要UNIQUE在数据库中存储散列数据并且我们有一个可以生成 365 个唯一散列值的散列算法，则有 50% 的机会发生数据冲突将在前 23 个数据条目后发生，并且在前 75 个数据库条目后发生 99.9%(!) 的冲突几率。

我们的算法可以生成的唯一哈希的数量和数据条目的数量可以呈指数增长，但冲突的概率将保持不变。如果这对吗？

我有一张巨大的交易表（用于电子商务），并且我将“收据”字段设置为唯一。实际的收据号码让我很困扰。

收据编号示例：BHF2Z47E仅大写 AZ/0-9，长度为 8 个符号。

更新：

生日悖论

score 1 · Accepted Answer

The birthday paradox just states that if you randomly generate values in a space of n, there is a rapid phase transition from no collisions to collisions when you store sqrt(n) values - that's where the probability increases to over 50%.

In your example you have an alphabet of 26 + 10 characters, and 8 digits; so that's 36^8 or about 2.8 trillion possible keys; you can expect to have more than a 50% collision probability after about 1.6 million entries; that's not very good. There is a decent chance of collision at even a fraction of that.

As a comparison, let's say you generated a 160-bit random key for each receipt (2^160 possible values); then you would need to generate about 2^80 receipts (around 10^24) to reach the same probability of collision. You could sell your product as a very large company for your whole life and probably still not see any. Another perspective is that your hard disk or computer will fail before you observe a collision.

The table in this article gives some concrete numbers for you. For example, with a 256-bit hash value and 10^31 values inserted, you would get a collision probability of 10^-15. According to that article, that is around the uncorrectable error rate of your hard disk. That's probably the magnitude of what you should be aiming for with your receipts to avoid overwriting them. It's not to hard to make the values just a little bigger.

Of course, this depends on the fact that you seed your PRNG properly with random data; otherwise you could get the same key easily :)

database - 数据库中的生日悖论（计算碰撞概率）

1 回答 1

Related

Reference