The birthday paradox just states that if you randomly generate values in a space of n
, there is a rapid phase transition from no collisions to collisions when you store sqrt(n)
values - that's where the probability increases to over 50%.
In your example you have an alphabet of 26 + 10 characters, and 8 digits; so that's 36^8
or about 2.8 trillion possible keys; you can expect to have more than a 50% collision probability after about 1.6 million entries; that's not very good. There is a decent chance of collision at even a fraction of that.
As a comparison, let's say you generated a 160-bit random key for each receipt (2^160
possible values); then you would need to generate about 2^80
receipts (around 10^24
) to reach the same probability of collision. You could sell your product as a very large company for your whole life and probably still not see any. Another perspective is that your hard disk or computer will fail before you observe a collision.
The table in this article gives some concrete numbers for you. For example, with a 256-bit hash value and 10^31
values inserted, you would get a collision probability of 10^-15
. According to that article, that is around the uncorrectable error rate of your hard disk. That's probably the magnitude of what you should be aiming for with your receipts to avoid overwriting them. It's not to hard to make the values just a little bigger.
Of course, this depends on the fact that you seed your PRNG properly with random data; otherwise you could get the same key easily :)