In our application, we let the user generate and send a wide variety of documents through it. Some of these will always be unique but a large percentage of them will be static. We store the files in a SQL 2008 DB using a FileStream for the actual data. I am looking for a way to detect when a file has already been stored so I don't store a duplicate.
I am thinking of generating a hash, using MD5 and using that hash as a key into the SQL database. What I am afraid of is the possibility of a collision occurring.
Some question I have are:
1: What is the likelyhood of getting a collision on the hash? Should I treat the unique key to be a combination of FileName, Size of File plus Hash?
2: What would you store the resulting hash as in the database? Should we store it as a binary field as?