0

In our application, we let the user generate and send a wide variety of documents through it. Some of these will always be unique but a large percentage of them will be static. We store the files in a SQL 2008 DB using a FileStream for the actual data. I am looking for a way to detect when a file has already been stored so I don't store a duplicate.

I am thinking of generating a hash, using MD5 and using that hash as a key into the SQL database. What I am afraid of is the possibility of a collision occurring.

Some question I have are:

1: What is the likelyhood of getting a collision on the hash? Should I treat the unique key to be a combination of FileName, Size of File plus Hash?

2: What would you store the resulting hash as in the database? Should we store it as a binary field as?

4

1 回答 1

1

这是非常常见的面试问题之一 - 所以应该进行大量的长时间讨论:)。

  1. 生日悖论——所以比较高。但是一些可以在恒定时间内获得的数据(如大小、第一个/最后一个 X 字节)可以使“散列”更长,因此更容易接受碰撞概率。我会使用产生更长哈希(Sha256?)的东西开始。

  2. 我会使用 Sha256 哈希的 Base64 字符串 + 任何其他有用的位(或任何其他可索引字段,我相信二进制不是)。

旁注我不会将文件名用作“哈希”的一部分,因为它不是二进制数据本身的一部分,并且可以独立更改。

于 2013-05-28T16:26:31.713 回答