我希望创建一个易于记忆的散列,例如 3 个随机单词(what3words),所以我的想法是散列一个 java 对象,结果是三个随机单词。
用例:我有很多字段的对象,我需要将字段压缩为 24 个字符(这是存储这些对象的数据库中 varchar 主键列的大小,不能更改),生成的压缩值应该也很容易被记住。
最初,我决定使用 3 个不同的散列函数(即 FNV1a64Hash、CRC32Hash 和 DJB2)来创建 3 个预散列,然后将这些值用作字典中的索引,但这导致了很多冲突(Random Words tried: 10000000 No of collisions: 9272419
)。请注意,我的字典大小约为 50k 个单词。
接下来,我决定只调用hashCode()
对象,然后填充结果 int,最后将其拆分为 3 个 5 位数字的块,不幸的是又发生了很多冲突(Random Words tried: 10000000 No of collisions: 9999900
)。我认为这部分可能归结为 int 的最大大小为 2^31,这只是一个 10 位数字,因此第一个索引始终为 00000。
我也使用了通用散列,但我再次遇到了相当多的冲突(Random Words tried: 10000000 No of collisions: 9996436
)
我想知道我是否在这里遗漏了一些明显的东西,或者是否有人知道任何可以在这里提供帮助的知名算法?提前为菜鸟问题道歉,这是我第一次遇到散列,还有很多东西要学。
我在下面粘贴了我的代码和测试代码,以防有明显问题。
public static String generate3Words1(Object obj) {
BigInteger input = BigInteger.valueOf(obj.hashCode());
int index1 = indexFor(CRC32Hash(input.toByteArray()));
int index2 = indexFor(FNV1a64Hash(input.toByteArray()));
int index3 = indexFor(DBJ2(input.toByteArray()));
return dictionary.get(index1) + "-" + dictionary.get(index2) + "-" + dictionary.get(index3);
}
public String generate3Words2(Object obj) {
int h = (h = obj.hashCode()) ^ (h >>> 16);
String i = String.format("%015d", h);
String s = dictionary.get(indexFor(Integer.parseInt(i.substring(0, 5)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(5, 10)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(10, 15))));
return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
}
private static int indexFor(long h) {
return (int) (h & (ThreeWordHash.dictionary.size() - 1));
}
private static long FNV1a64Hash(byte[] data) {
long hash = 0xcbf29ce484222325L;
for (byte datum : data) {
hash ^= (datum & 0xff);
hash *= 1099511628211L;
}
return hash;
}
private static long CRC32Hash(byte[] data) {
CRC32.reset();
CRC32.update(data);
return CRC32.getValue();
}
private static long DBJ2(byte[] data) {
long hash = 5381;
for (byte datum : data) {
hash = ((hash << 5) + hash) + datum;
}
return hash;
}
private static String universalHashing(Object data) {
int[] hashCodes = new int[NO_OF_WORDS];
int hashCodeSizeDiff = WORD_SIZE - (WORD_SIZE / 2);
int hstart = data.hashCode();
int bmax = 1 << hashCodeSizeDiff;
for (int i = 0; i < NO_OF_WORDS; i++) {
hashCodes[i] = (((hstart * (i * 2 + 1)) + RAND.nextInt(bmax)) >> hashCodeSizeDiff) & (ThreeWordHash.dictionary.size() - 1);
}
String s = ThreeWordHash.dictionary.get(hashCodes[0]) + " " + ThreeWordHash.dictionary.get(hashCodes[1]) + " " + ThreeWordHash.dictionary.get(hashCodes[2]);
return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
}
测试代码:
@Test
void generate3Words() {
List<String> words = new ArrayList<>(TestDictionary.WORDS);
words.addAll(TestDictionary.WORDS);
Random random = new Random(1);
HashSet<String> seen1 = new HashSet<>();
HashSet<String> seen2 = new HashSet<>();
int count = 0;
int noOfIterations = 10000000;
//NOTE test dict size approx 4k words
for (int j = 0; j < noOfIterations; j++) {
String randomWord = new StringBuilder()
.append(words.get(random.nextInt(TestDictionary.WORDS.size())))
.append(words.get(random.nextInt(TestDictionary.WORDS.size())))
.append(words.get(random.nextInt(TestDictionary.WORDS.size())))
.append(words.get(random.nextInt(TestDictionary.WORDS.size())))
.append(words.get(random.nextInt(TestDictionary.WORDS.size()))).toString();
String res = ThreeWordHash.generate3Words(randomWord);
if (seen1.contains(res) && !seen2.contains(randomWord)) {
count++;
}
seen2.add(randomWord);
seen1.add(res);
}
System.out.println("Random Words tried: " + seen2.size() + " No of collisions: " + count);
}