其他海报建议将字符转换为素数并将它们相乘。如果你对一个大素数取模,你会得到一个不会溢出的好散列函数。我针对大多数英语单词的 Unix 单词列表测试了以下 Ruby 代码,发现不是彼此字谜的单词之间没有哈希冲突。(在 MAC OS X 上,此文件位于:/usr/share/dict/words。)
我的 word_hash 函数采用每个字符 mod 32 的序数值。这将确保大写和小写字母具有相同的代码。我使用的大素数是 2^58 - 27。只要小于 2^64 / A,任何大素数都可以,其中 A 是我的字母大小。我使用 32 作为我的字母大小,所以这意味着我不能使用大于大约 2^59 - 1 的数字。因为 ruby 使用一位作为符号,另一位表示该值是数字还是对象,我比其他语言输了一点。
def word_hash(w)
# 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
# Punctuation gets assigned values that overlap the letters, but we don't care about that much.
primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
# Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
prime_modulus = (1 << 58) - 27
h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end
words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count
whash = {}
inverse_hash = {}
words.each do |w|
h = word_hash(w)
whash[w] = h
x = inverse_hash[h]
if x && x.each_char.sort.join != w.each_char.sort.join
puts "Collision between #{w} and #{x}"
else
inverse_hash[h] = w
end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."