java - 容易记忆的哈希（三个字）

Question

我希望创建一个易于记忆的散列，例如 3 个随机单词（what3words），所以我的想法是散列一个 java 对象，结果是三个随机单词。

用例：我有很多字段的对象，我需要将字段压缩为 24 个字符（这是存储这些对象的数据库中 varchar 主键列的大小，不能更改），生成的压缩值应该也很容易被记住。

最初，我决定使用 3 个不同的散列函数（即 FNV1a64Hash、CRC32Hash 和 DJB2）来创建 3 个预散列，然后将这些值用作字典中的索引，但这导致了很多冲突（Random Words tried: 10000000 No of collisions: 9272419）。请注意，我的字典大小约为 50k 个单词。

接下来，我决定只调用hashCode()对象，然后填充结果 int，最后将其拆分为 3 个 5 位数字的块，不幸的是又发生了很多冲突（Random Words tried: 10000000 No of collisions: 9999900）。我认为这部分可能归结为 int 的最大大小为 2^31，这只是一个 10 位数字，因此第一个索引始终为 00000。

我也使用了通用散列，但我再次遇到了相当多的冲突（Random Words tried: 10000000 No of collisions: 9996436）

我想知道我是否在这里遗漏了一些明显的东西，或者是否有人知道任何可以在这里提供帮助的知名算法？提前为菜鸟问题道歉，这是我第一次遇到散列，还有很多东西要学。

我在下面粘贴了我的代码和测试代码，以防有明显问题。

public static String generate3Words1(Object obj) {
    BigInteger input = BigInteger.valueOf(obj.hashCode());
    int index1 = indexFor(CRC32Hash(input.toByteArray()));
    int index2 = indexFor(FNV1a64Hash(input.toByteArray()));
    int index3 = indexFor(DBJ2(input.toByteArray()));
    return dictionary.get(index1) + "-" + dictionary.get(index2) + "-" + dictionary.get(index3);
}


public String generate3Words2(Object obj) {
       int h = (h = obj.hashCode()) ^ (h >>> 16);
       String i = String.format("%015d", h);      
       String s = dictionary.get(indexFor(Integer.parseInt(i.substring(0, 5)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(5, 10)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(10, 15))));
       return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
   }

private static int indexFor(long h) {
    return (int) (h & (ThreeWordHash.dictionary.size() - 1));
}

private static long FNV1a64Hash(byte[] data) {
    long hash = 0xcbf29ce484222325L;
    for (byte datum : data) {
        hash ^= (datum & 0xff);
        hash *= 1099511628211L;
    }
    return hash;
}

private static long CRC32Hash(byte[] data) {
    CRC32.reset();
    CRC32.update(data);
    return CRC32.getValue();
}


private static long DBJ2(byte[] data) {
    long hash = 5381;
    for (byte datum : data) {
        hash = ((hash << 5) + hash) + datum;
    }
    return hash;
}

private static String universalHashing(Object data) {
     int[] hashCodes = new int[NO_OF_WORDS];
     int hashCodeSizeDiff = WORD_SIZE - (WORD_SIZE / 2);
     int hstart = data.hashCode();
     int bmax = 1 << hashCodeSizeDiff;
     for (int i = 0; i < NO_OF_WORDS; i++) {
         hashCodes[i] = (((hstart * (i * 2 + 1)) + RAND.nextInt(bmax)) >> hashCodeSizeDiff) & (ThreeWordHash.dictionary.size() - 1);
     }
     String s = ThreeWordHash.dictionary.get(hashCodes[0]) + " " + ThreeWordHash.dictionary.get(hashCodes[1]) + " " + ThreeWordHash.dictionary.get(hashCodes[2]);
     return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
 }

测试代码：

@Test
void generate3Words() {
    List<String> words = new ArrayList<>(TestDictionary.WORDS);
    words.addAll(TestDictionary.WORDS);

    Random random = new Random(1);
    HashSet<String> seen1 = new HashSet<>();
    HashSet<String> seen2 = new HashSet<>();
    
    int count = 0;
    int noOfIterations = 10000000;
    
    //NOTE test dict size approx 4k words
    for (int j = 0; j < noOfIterations; j++) {
        String randomWord =  new StringBuilder()
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size()))).toString();
      
        String res = ThreeWordHash.generate3Words(randomWord);
        
        if (seen1.contains(res) && !seen2.contains(randomWord)) {
            count++;
        }

        seen2.add(randomWord);
        seen1.add(res);
    }
    System.out.println("Random Words tried: " + seen2.size() + " No of collisions: " + count);
}

java - 容易记忆的哈希（三个字）

0 回答 0

Related

Reference