language-agnostic - What algorithm should be used when doing filechecksums to find dupes?

Question

Is taking a MD5 sum still suitable for checking for file dupes? I know that it isn't secure, but does that really matter in the case of trying to find file dupes?

Should I be using something in the SHA family instead?

What is best practice in this use case?

score 6 · Accepted Answer

In this particular case, choice of algorithm probably isn't that significant. The key reasons for using SHA1 over MD5 all relate to creating cryptographically secure signatures.

MD5 should be perfectly acceptable for this task, as you probably don't need to worry about people maliciously crafting files to generate false duplicates.

score 2 · Accepted Answer

如果您关心性能，我认为最好先检查匹配的文件大小，然后使用快速散列函数（CRC32 或 MD5，应该比 SHA1 更快），并尝试使用 MD5、SHA1 或SHA256（取决于任务的关键性）。

score 1 · Accepted Answer

1

SHA1 is slightly better as a checksum than MD5. It is what Git uses.

于 2010-01-03T02:11:21.640 回答

score 1 · Accepted Answer

MD5 has known vulnerabilities at this point, but that may not be a problem for your application. It's still reasonably good for distinguishing piles of bits. If something comes up with no match, then you know you haven't already seen it, since the algorithm is deterministic. If something comes back as a match, you should actually compare it to the blob that it ostensibly matched before acting as if it's really a duplicate. MD5 is relatively fast, but if you can't afford full-text comparisons on hash collisions, you should probably use a stronger hash, like SHA-256.

score 1 · Accepted Answer

出于描述的目的，没有真正可取的解决方案，两个散列函数都可以解决问题。无论如何，MD5 通常会比 SHA1 稍快一些。

python中的示例：

#!/usr/bin/env python

import hashlib, cProfile

def repeat(f, loops=10000000):
    def wrapper(): 
        for i in range(loops): f()
    return wrapper

@repeat
def test_md5():
    md5 = hashlib.md5(); md5.update("hello"); md5.hexdigest()

@repeat 
def test_sha1():
    sha = hashlib.sha1(); sha.update("hello"); sha.hexdigest()

cProfile.run('test_md5()')
cProfile.run('test_sha1()')

#
#         40000004 function calls in 59.841 CPU seconds
# 
# ....
#
#         40000004 function calls in 65.346 CPU seconds
# 
# ....

score 1 · Accepted Answer

您在谈论的是校验和，它与加密哈希相关（但不相同）。

是的，只要您不担心恶意用户故意制作具有相同校验和的两个不同文件，MD5 甚至 CRC 都可以作为校验和工作。如果这是一个问题，请使用 SHA1，或者更好的是，使用一些加密完整的散列。

score 0 · Accepted Answer

0

While MD5 does have a few collisions, I've always used it for files and it's worked just fine.

于 2010-01-03T02:12:30.907 回答

score 0 · Accepted Answer

我们在我的工作中使用 MD5 来满足您的考虑。效果很好。我们只需要在每个客户的基础上检测重复上传，这减少了我们遇到生日问题的风险，但是如果我们必须检测所有上传而不是每个客户的重复，md5 对我们来说仍然足够。如果您可以相信互联网，那么给定 n 个样本和 b 的哈希大小的碰撞概率 p 的边界为：

p <= n (n - 1) / (2 * 2 ^ b)

几年前，我对 n = 10^9 和 b = 128 进行了计算，得出了 p <= 1.469E-21。从这个角度来看，10^9 个文件是每秒一个，持续 32 年。所以我们不会在发生冲突时比较文件。如果 md5 说上传是相同的，那么它们是相同的。

language-agnostic - What algorithm should be used when doing filechecksums to find dupes?

8 回答 8

Related

Reference