MD5 has known vulnerabilities at this point, but that may not be a problem for your application. It's still reasonably good for distinguishing piles of bits. If something comes up with no match, then you know you haven't already seen it, since the algorithm is deterministic. If something comes back as a match, you should actually compare it to the blob that it ostensibly matched before acting as if it's really a duplicate. MD5 is relatively fast, but if you can't afford full-text comparisons on hash collisions, you should probably use a stronger hash, like SHA-256.