algorithm - 为所有字谜生成相同的唯一哈希码

Question

最近，我参加了一次采访，遇到了一个关于哈希冲突的好问题。

问题：给定一个字符串列表，一起打印出字谜。

示例 :

i/p : {act, god, animal, dog, cat}

o/p : act, cat, dog, god

我想创建哈希图并将单词作为键和值作为字谜列表

为避免冲突，我想为字谜生成唯一的哈希码，而不是排序并使用排序后的单词作为键。

我正在寻找除使用链接之外处理冲突的哈希算法。我希望算法为 act 和 cat 生成相同的哈希码...以便它将下一个单词添加到值列表中

任何人都可以提出一个好的算法吗？

score 28 · Accepted Answer

用排序后的字符串散列非常好，我可能会这样做，但它确实可能很慢而且很麻烦。这是另一个想法，不确定它是否有效 - 选择一组素数，如你所愿，与你的字符集大小相同，并构建一个从你的字符到那个的快速映射函数。然后对于给定的单词，将每个字符映射到匹配的素数，然后相乘。最后，使用结果进行散列。

这与 Heuster 的建议非常相似，只是碰撞较少（实际上，鉴于任何数的素数分解的唯一性，我相信不会有错误的碰撞）。

简单例如-

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word);

[编辑]

关于唯一性的几句话 - 任何整数都有一个质数乘法的分解，所以给定哈希中的整数键，您实际上可以重建所有可能的字符串，这些字符串将散列到它，只有这些词。只需分解为素数 p1^n1*p2^n2*... 并将每个素数转换为匹配的字符。p1 的字符将出现 n1 次，依此类推。你不能得到任何你没有明确使用过的新素数，成为素数意味着你不能通过任何其他素数的乘法得到它。

这带来了另一个可能的改进——如果你可以构造字符串，你只需要标记你在填充散列时看到的排列。由于排列可以按字典顺序排序，因此您可以用一个数字替换每个排列。这将节省在散列中存储实际字符串的空间，但需要更多的计算，因此它不一定是一个好的设计选择。尽管如此，它还是使最初的面试问题复杂化了:)

score 6 · Accepted Answer

哈希函数：为每个字符分配主要数字。在计算哈希码时，获取分配给该字符的素数并乘以现有值。现在所有字谜产生相同的哈希值。

例如：a - 2，c - 3 t - 7

cat 的哈希值 = 3*2*7 = 42 act 的哈希值 = 2*3*7 = 42 打印所有具有相同哈希值的字符串（字谜将具有相同的哈希值）

score 3 · Accepted Answer

其他海报建议将字符转换为素数并将它们相乘。如果你对一个大素数取模，你会得到一个不会溢出的好散列函数。我针对大多数英语单词的 Unix 单词列表测试了以下 Ruby 代码，发现不是彼此字谜的单词之间没有哈希冲突。（在 MAC OS X 上，此文件位于：/usr/share/dict/words。）

我的 word_hash 函数采用每个字符 mod 32 的序数值。这将确保大写和小写字母具有相同的代码。我使用的大素数是 2^58 - 27。只要小于 2^64 / A，任何大素数都可以，其中 A 是我的字母大小。我使用 32 作为我的字母大小，所以这意味着我不能使用大于大约 2^59 - 1 的数字。因为 ruby 使用一位作为符号，另一位表示该值是数字还是对象，我比其他语言输了一点。

def word_hash(w)
  # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
  # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
  # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
  prime_modulus = (1 << 58) - 27
  h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end

words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count

whash = {}
inverse_hash = {}
words.each do |w|
  h = word_hash(w)
  whash[w] = h
  x = inverse_hash[h]
  if x && x.each_char.sort.join != w.each_char.sort.join
    puts "Collision between #{w} and #{x}"
  else
    inverse_hash[h] = w
  end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."

score 3 · Accepted Answer

实用的小优化，我建议上面的哈希方法是：

将最少的素数分配给元音，然后分配最频繁出现的辅音。例如：e：2 a：3 i：5 o：7 u：11 t：13 等等......

此外，英语的平均字长为：~ 6

此外，前 26 个素数小于 100 [2,3,5,7, .. , 97]

因此，平均而言，您的哈希将产生大约 100^6 = 10^12 的值。

因此，如果您对大于 10 ^ 12 的模数取质数，那么碰撞的可能性就会非常小。

score 2 · Accepted Answer

上面的复杂性似乎很错位！您不需要素数或散列。这只是三个简单的操作：

将每个 OriginalWord映射到 (SortedWord, OriginalWord) 的元组。 示例：“cat”变成（“act”，“cat”）；“狗”变成（“dgo”，“狗”）。 这是对每个 OriginalWord 字符的简单排序。
按元组的第一个元素对元组进行排序。 示例：("dgo", "dog"), ("act, "cat") 排序为 ("act", "cat"), ("dgo", "dog")。 这是对整体的简单排序收藏。
遍历元组（按顺序），发出 OriginalWord。 示例： ("act", "cat"), ("dgo", "dog") 发出 "cat" "dog"。这是一个简单的迭代。

只需两次迭代和两种排序！

在 Scala 中，这正是一行代码：

val words = List("act", "animal", "dog", "cat", "elvis", "lead", "deal", "lives", "flea", "silent", "leaf", "listen")

words.map(w => (w.toList.sorted.mkString, w)).sorted.map(_._2)
# Returns: List(animal, act, cat, deal, lead, flea, leaf, dog, listen, silent, elvis, lives)

或者，正如原始问题所暗示的那样，您只需要计数 > 1 的情况，它只是多一点：

scala> words.map(w => (w.toList.sorted.mkString, w)).groupBy(_._1).filter({case (k,v) => v.size > 1}).mapValues(_.map(_._2)).values.toList.sortBy(_.head)
res64: List[List[String]] = List(List(act, cat), List(elvis, lives), List(flea, leaf), List(lead, deal), List(silent, listen))

score 0 · Accepted Answer

使用素数乘积的解决方案非常棒，这是一个 Java 实现，以防任何人都需要它。

class HashUtility {
    private int n;
    private Map<Character, Integer> primeMap;

    public HashUtility(int n) {
        this.n = n;
        this.primeMap = new HashMap<>();
        constructPrimeMap();
    }

    /**
     * Utility to check if the passed {@code number} is a prime.
     *
     * @param number The number which is checked to be prime.
     * @return {@link boolean} value representing the prime nature of the number.
     */
    private boolean isPrime(int number) {
        if (number <= 2)
            return number == 2;
        else
            return (number % 2) != 0
                    &&
                    IntStream.rangeClosed(3, (int) Math.sqrt(number))
                            .filter(n -> n % 2 != 0)
                            .noneMatch(n -> (number % n == 0));
    }

    /**
     * Maps all first {@code n} primes to the letters of the given language.
     */
    private void constructPrimeMap() {
        List<Integer> primes = IntStream.range(2, Integer.MAX_VALUE)
                .filter(this::isPrime)
                .limit(this.n)      //Limit the number of primes here
                .boxed()
                .collect(Collectors.toList());

        int curAlphabet = 0;
        for (int i : primes) {
            this.primeMap.put((char) ('a' + curAlphabet++), i);
        }
    }

    /**
     * We calculate the hashcode of a word by calculating the product of each character mapping prime. This works since
     * the product of 2 primes is unique from the products of any other primes.
     * <p>
     * Since the hashcode can be huge, we return it modulo a large prime.
     *
     * @param word The {@link String} to be hashed.
     * @return {@link int} representing the prime hashcode associated with the {@code word}
     */
    public int hashCode(String word) {
        long primeProduct = 1;
        long mod = 100000007;
        for (char currentCharacter : word.toCharArray()) {
            primeProduct *= this.primeMap.get(currentCharacter) % mod;
        }

        return (int) primeProduct;
    }
}

请让我知道是否/如何改进这一点。

score 0 · Accepted Answer

我们可以使用数组的二进制值表示。此代码片段假设所有字符都是小拉丁字符。

public int hashCode() {
    //TODO: so that each set of anagram generates same hashCode
    int sLen = s.length();
    int [] ref = new int[26];
    for(int i=0; i< sLen; i++) {
      ref[s.charAt(i) - 'a'] +=1;
    }
    int hashCode = 0;
    for(int i= 0; i < ref.length; i++) {
      hashCode += new Double(Math.pow(2, i)).intValue() * ref[i];
    }
    return hashCode;
  }

algorithm - 为所有字谜生成相同的唯一哈希码

7 回答 7

Related

Reference