language-agnostic - 在应该是唯一的值上处理几乎不可能的冲突

Question

有许多系统依赖于某些特定值的唯一性。想到任何使用 GUID 的东西（例如 Windows 注册表或其他数据库），还有从对象创建散列以识别它并因此需要此散列唯一的东西。

散列表通常不介意两个对象是否具有相同的散列，因为散列仅用于将对象分解为类别，因此在查找时，不是表中的所有对象，而是同一类别中的那些对象（ bucket) 必须与搜索对象进行身份比较。

然而，其他实现（似乎）取决于唯一性。我的例子（这就是我提出这个问题的原因）是 Mercurial 的修订 ID。Mercurial 邮件列表上的一个条目正确地指出

在您的前十亿次提交中，变更集哈希意外冲突的几率基本上为零。但我们会注意到它是否发生。你会因为意外破坏 SHA1 的人而出名。

但即使是最小的概率也不意味着不可能。现在，我不想解释为什么完全可以依赖唯一性（例如，这里已经讨论过）。这对我来说很清楚。

相反，我想知道（也许通过您自己工作中的示例）：

无论如何，是否有任何最佳实践来涵盖这些不太可能的案例？
是否应该忽略它们，因为特别强烈的太阳风更有可能导致硬盘读取错误？
是否应该至少对它们进行测试，如果只是向用户显示“我放弃，你已经完成了不可能”的消息而失败？
还是应该优雅地处理这些案件？

对我来说，尤其是以下内容很有趣，尽管它们有些敏感：

如果你不处理这些情况，你会如何对抗不听概率的直觉？
如果你确实处理了它们，你如何证明这项工作的合理性（对你自己和其他人），考虑到你没有处理更多可能的情况，比如超非瓦斯？

score 7 · Accepted Answer

If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?

The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.

However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.

Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.

If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.

Or should even these cases get handled gracefully?

The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).

score 4 · Accepted Answer

给定一个良好的 128 位散列，在给定随机输入的情况下与特定散列值发生冲突的可能性是：

1 / 2 ** 128大约等于3 * 10 ** -39。

可以使用用于解释生日问题的逻辑来计算给定样本没有发生碰撞的概率 ( p) 。n

p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)

其中!表示阶乘函数。随着样本数量的增加，我们可以绘制没有碰撞的概率：

随着样本数量的增加，随机 SHA-1 冲突的概率。http://img21.imageshack.us/img21/9186/sha1collision.png

在10**17和10**18散列之间，我们开始看到从 0.001% 到 0.14% 到 13% 的不平凡的冲突可能性10**19。因此，在一个拥有一百万、十亿个记录的系统中，依靠唯一性记录可能是不明智的（这样的系统是可以想象的），但在绝大多数系统中，发生冲突的可能性非常小，以至于您可以依赖散列的唯一性出于所有实际目的。

现在，抛开理论不谈，碰撞更有可能通过错误或有人攻击您的系统被引入您的系统，因此即使意外碰撞的可能性很小（即也就是说bug或恶意的概率比意外碰撞的概率要高得多）。

language-agnostic - 在应该是唯一的值上处理几乎不可能的冲突

2 回答 2

Related

Reference