java - 字符串连接上的特定类型的哈希

Question

我需要一个具有以下属性的 Java 中的专用散列函数 h(X,Y)。

X 和 Y 是字符串。
h(X,Y) = h(Y,X)。
X 和 Y 是任意长度的字符串，h(X,Y) 的结果也没有长度限制。
如果 X 不等于 A 且 Y 不等于 B，则 h(X,Y) 和 h(Y,X) 不应与 h(A,B) = h(B,A) 发生冲突。
h() 不需要是安全散列函数，除非必须满足上述要求。
相当高性能，但这是一个开放式标准。

在我看来，我认为要求 2 和 4 有点矛盾，但也许我担心太多了。

目前，我在 Java 中所做的事情如下：

public static BigInteger hashStringConcatenation(String str1, String str2) {
    BigInteger bA = BigInteger.ZERO;
    BigInteger bB = BigInteger.ZERO;
    for(int i=0; i<str1.length(); i++) {
        bA = bA.add(BigInteger.valueOf(127L).pow(i+1).multiply(BigInteger.valueOf(str1.codePointAt(i))));
    }
    for(int i=0; i<str2.length(); i++) {
        bB = bB.add(BigInteger.valueOf(127L).pow(i+1).multiply(BigInteger.valueOf(str2.codePointAt(i))));
    }
    return bA.multiply(bB);
}

我认为这很可怕，但这就是为什么我正在寻找更好的解决方案。谢谢。

忘了提到在 OS X 10.7 上具有 8GB RAM 和 Java 1.6 的 2.53GHz 双核 Macbook Pro 上，两个 8 (ASCII) 字符串的哈希函数大约需要 270 微秒。我怀疑随着字符串大小的增加，或者如果使用 Unicode 字符，这会更高。

score 3 · Accepted Answer

3

为什么不把他们的 hashCode 加在一起呢？

于 2012-07-31T13:37:33.800 回答

score 1 · Accepted Answer

您对要求 4 的要求有多严格？如果答案是“不完全严格”，那么您可以将两个字符串连接起来，将较小的字符串放在第一位（这将导致 h('A', 'B') 和 h('AB', '') 发生冲突)

如果有任何您确定永远不会出现在字符串值中的字符，那么您可以使用单个实例作为分隔符，这将解决上面的冲突。

score 1 · Accepted Answer

3) 如果 X 不等于 A 且 Y 不等于 B，则 h(X,Y) 和 h(Y,X) 不应与 h(A,B) = h(B,A) 发生冲突。

我认为这个要求规定了任何产生小于（平均）原始字符串的数字的哈希函数。

任何不发生碰撞的要求都会遇到鸽洞原则的障碍。

score 1 · Accepted Answer

从第 4 点我们可以得到，在为真之前h(x,"")永远不会发生冲突。所以，你对产生的东西没有大小限制，因为它应该为每个独特的产生独特的结果。但是有无数个唯一的字符串。我认为这不是一个正确的哈希函数。h(y,"")x.equals(y)h(x,y)x

score 1 · Accepted Answer

今天我决定为这个哈希函数问题添加我的解决方案。它没有经过很好的测试，我也没有测量它的性能，所以你可以用你的评论反馈给我。我的解决方案位于以下：

public abstract class HashUtil {
    //determines that we want hash, that has size of 32 integers ( or 32*32 bits )
    private static final int hash_size = 32;

    //some constants that can be changed in sake of avoiding collisions
    private static final BigInteger INITIAL_HASH = BigInteger.valueOf(7);
    private static final BigInteger HASH_MULTIPLIER = BigInteger.valueOf(31);
    private static final BigInteger HASH_DIVIDER = BigInteger.valueOf(2).pow(32*hash_size);

    public static BigInteger computeHash(String arg){
        BigInteger hash = new BigInteger(INITIAL_HASH.toByteArray());
        for (int i=0;i<arg.length()/hash_size+1;i++){
            int[] tmp = new int[hash_size];
            for(int j=0;j<Math.min(arg.length()-32*i,32);j++){
                tmp[i]=arg.codePointAt(i*hash_size+j);
            }
            hash = hash.multiply(HASH_MULTIPLIER).add(new BigInteger(convert(tmp)).abs()).mod(HASH_DIVIDER);
        }
        //to reduce result space to something meaningful
        return hash;
    }

    public static BigInteger computeHash(String arg1,String arg2){
        //here I don't forgot about reducing of result space
        return computeHash(arg1).add(computeHash(arg2)).mod(HASH_DIVIDER);
    }

    private static byte[] convert(int[] arg){
        ByteBuffer byteBuffer = ByteBuffer.allocate(arg.length*4);
        IntBuffer intBuffer = byteBuffer.asIntBuffer();
        intBuffer.put(arg);
        return byteBuffer.array();
    }

    public static void main(String[] args){
        String firstString="dslkjfaklsjdkfajsldfjaldsjflaksjdfklajsdlfjaslfj",secondString="unejrng43hti9uhg9rhe3gh9rugh3u94htfeiuwho894rhgfu";
        System.out.println(computeHash(firstString,secondString).equals(computeHash(secondString,firstString)));
    }

}

我想我的解决方案不应该对长度小于 32 的单个字符串产生任何冲突（更准确地说，对于长度小于hash_size变量值的单个字符串）。也不是很容易找到碰撞（我认为）。要为您的特定任务调节哈希冲突概率，您可以尝试使用其他素数而不是变量中的7和。你怎么看待这件事？对你来说足够好吗？31INITIAL_HASHHASH_MULTIPLIER

PS我认为如果你尝试更大的素数会更好。

score 0 · Accepted Answer

建立在 String#hashCode 之上，这不是一个完美的哈希函数，因此它不满足条件 4。

public static long hashStringConcatenation(String str1, String str2) {
    int h1 = str1.hashCode();
    int h2 = str2.hashCode();

    if ( h1 < h2 )
    {
        return ((long)h1)<<32 & h2;
    }
    else
    {
        return ((long)h2)<<32 & h1;
    }
}

score 0 · Accepted Answer

好的，@gkuzmin 的评论让我想到了为什么我要使用 127 的功能。所以，这里有一个稍微简单的代码版本。变化如下：

我不再使用 127 的幂，而是将 codePointAt 数字连接为字符串，将每个输入字符串的结果转换为 BigInteger，然后添加两个 BigInteger。
为了压缩答案，我在最终答案上做了一个 mod 2^1024。

速度并没有更好（也许更糟！）但是我认为我测量速度的方式不正确，因为它可能还测量了函数调用所花费的时间。

这是修改后的代码。这是否满足所有条件，尽管在 2^1024 结果空间上可能发生重复的不幸情况下满足 4？

public static BigInteger hashStringConcatenation(String str1, String str2) {
    if(str1==null || str1.isEmpty() || str2 == null || str2.isEmpty()) {
        return null;
    }
    BigInteger bA, bB;
    String codeA = "", codeB = "";
    for(int i=0; i<str1.length(); i++) {
        codeA += str1.codePointAt(i);
    }
    for(int i=0; i<str2.length(); i++) {
        codeB += str2.codePointAt(i);
    }
    bA = new BigInteger(codeA);
    bB = new BigInteger(codeB);
    return bA.add(bB).mod(BigInteger.valueOf(2).pow(1024));
}

score 0 · Accepted Answer

我决定添加另一个答案，因为@Anirban Basu 提出了另一种解决方案。所以，我不知道如何提供他的帖子的链接，如果有人知道怎么做 - 纠正我。

Anirban 的解决方案如下所示：

public static BigInteger hashStringConcatenation(String str1, String str2) {
    if(str1==null || str1.isEmpty() || str2 == null || str2.isEmpty()) {
        return null;
    }
    BigInteger bA, bB;
    String codeA = "", codeB = "";
    for(int i=0; i<str1.length(); i++) {
        codeA += str1.codePointAt(i);
    }
    for(int i=0; i<str2.length(); i++) {
        codeB += str2.codePointAt(i);
    }
    bA = new BigInteger(codeA);
    bB = new BigInteger(codeB);
    return bA.add(bB).mod(BigInteger.valueOf(2).pow(1024));
}

您的新解决方案现在看起来像一个哈希函数，但它仍然存在一些问题。我建议你应该考虑一下：

NullPointerException也许抛出或IllegalArgumentException何时null用作函数参数会更好？您确定不想计算空字符串的哈希值吗？
要连接大量字符串，最好使用StringBuffer而不是+运算符。使用此类将对您的代码性能产生巨大的积极影响。
您的哈希函数不是很安全 - 计算字符串非常容易，这会产生冲突。

您可以尝试此代码来检查可以证明您的哈希函数冲突的算法。

public static void main(String[] args){
    String firstString=new StringBuffer().append((char)11).append((char)111).toString();
    String secondString=new StringBuffer().append((char)111).append((char)11).toString();

    BigInteger hash1 = hashStringConcatenation(firstString,"arbitrary_string");
    BigInteger hash2 = hashStringConcatenation(secondString,"arbitrary_string");
    System.out.println("Is hash equal: "+hash1.equals(hash2));
    System.out.println("Conflicted values: {"+firstString+"},{"+secondString+"}");
}

所以，破解你的哈希函数真的很容易。此外，它有 2^1024 个结果空间是好的，但是对于您的实现来说，很多现实生活中的冲突在于非常接近和简单的字符串。

PS我认为你应该阅读一些关于已经开发的散列算法，在现实生活中失败的散列函数（比如String过去只使用16个第一个字符计算散列的java类散列函数），并尝试根据你的要求检查你的解决方案和现实生活。至少您可以尝试手动查找哈希冲突，如果您成功了，那么您的解决方案很可能已经存在一些问题。

score 0 · Accepted Answer

这是我根据@gkuzmin 的建议更改的代码：

public static BigInteger hashStringConcatenation(String str1, String str2) {
    BigInteger bA = BigInteger.ZERO, bB = BigInteger.ZERO;
    StringBuffer codeA = new StringBuffer(), codeB = new StringBuffer();
    for(int i=0; i<str1.length(); i++) {
        codeA.append(str1.codePointAt(i));
    }
    for(int i=0; i<str2.length(); i++) {
        codeB.append(str2.codePointAt(i));
    }
    bA = new BigInteger(codeA.toString());
    bB = new BigInteger(codeB.toString());
    return bA.multiply(bB).mod(BigInteger.valueOf(2).pow(1024));
}

请注意，在结果中，我现在将 bA 与 bB 相乘，而不是相加。

此外，添加了@gkuzmin 建议的测试功能：

public static void breakTest2() {
    String firstString=new StringBuffer().append((char)11).append((char)111).toString();
    String secondString=new StringBuffer().append((char)111).append((char)11).toString();
    BigInteger hash1 = hashStringConcatenation(firstString,"arbitrary_string");
    BigInteger hash2 = hashStringConcatenation(secondString,"arbitrary_string");
    System.out.println("Is hash equal: "+hash1.equals(hash2));
    System.out.println("Conflicted values: {"+firstString+"},{"+secondString+"}");
}

和另一个只有数字值的字符串的测试：

public static void breakTest1() {
    Hashtable<String,String> seenTable = new Hashtable<String,String>();
    for (int i=0; i<100; i++) {
        for(int j=i+1; j<100; j++) {
            String hash = hashStringConcatenation(""+i, ""+j).toString();
            if(seenTable.contains(hash)) {
                System.out.println("Duplication for " + seenTable.get(hash) + " with " + i + "-" + j);
            }
            else {
                seenTable.put(hash, i+"-"+j);
            }
        }
    }
}

代码运行。当然，这不是一个详尽的检查，但是 breakTest1() 函数没有任何问题。@gkuzmin 的函数显示以下内容：

Is hash equal: true
Conflicted values: {                    o},{o                         }

为什么这两个字符串产生相同的哈希？因为它们在这两种情况下都有效地使用了字符串 '11111arbitrary_string'。这是个问题。

score 0 · Accepted Answer

现在稍微修改的功能怎么样？

public static BigInteger hashStringConcatenation(String str1, String str2) {
    BigInteger bA = BigInteger.ZERO, bB = BigInteger.ZERO;
    StringBuffer codeA = new StringBuffer(), codeB = new StringBuffer();
    for(int i=0; i<str1.length(); i++) {
        codeA.append(str1.codePointAt(i)).append("0");
    }
    for(int i=0; i<str2.length(); i++) {
        codeB.append(str2.codePointAt(i)).append("0");
    }
    bA = new BigInteger(codeA.toString());
    bB = new BigInteger(codeB.toString());
    return bA.multiply(bB).mod(BigInteger.valueOf(2).pow(1024));
}

在这里，我们在每个字符代码之间添加了一个分隔符“0”，因此字符 11 111 和 111 11 的组合将不再混淆函数，因为串联会产生 110111 和 111011。但是，它仍然不会破坏要求 2原来的问题。

那么现在这是否解决了问题，尽管在 2^1024 范围内？

java - 字符串连接上的特定类型的哈希

10 回答 10

Related

Reference