0

我正在尝试破解 Java 中的哈希算法:

    public static int encode(String file) {
            int hash = 0;
            file = file.toUpperCase();

            for(int i = 0; i < file.length(); i++) {
                    hash = (hash * 61 + file.charAt(i)) - 32;
            }
            return hash;
    }

这是我的尝试:

public static String decode(int hash) {
            long realHash = (hash < 0 ? Integer.MAX_VALUE + Math.abs(hash) : hash);
            ByteBuffer buffer = ByteBuffer.allocate(50);

            while (realHash > 0) {
                    buffer.put((byte) ((realHash % 61) + 32));
                    realHash = (realHash - 32) / 61;
            }
            buffer.flip();
            return new String(buffer.array()).trim();
    }

我的解决方案似乎有严重的数据丢失,由于整数溢出,我认为我不能完全取消散列较长的文本数据。有什么建议么?

4

1 回答 1

8

It's not the integer overflow that's the problem. That's like driving over lava, having your car explode, and concluding that you didn't buy the right kind of gas.

The real problem is that you can't "crack" hash algorithms. There's one big reason why:

Information Theory

There's a term in information theory known as Shannon entropy. (Fair warning: that article is not easy to get through.) The quick version is that there's a minimum number of bits required to encode any given chunk of information.

This site has a calculator that claims to determine the amount of entropy (i.e. the minimum number of bits) required to losslessly encode a given text. I provided it with this chunk of artisanal filler text:

Meggings forage bitters tofu, Wes Anderson food truck craft beer iPhone. Single-origin coffee scenester narwhal, mumblecore mlkshk jean shorts chia trust fund art party pour-over you probably haven't heard of them bitters. Polaroid craft beer Intelligentsia, vinyl Marfa Brooklyn umami.

Assuming int is 32 bits on your system, you only have 32 bits of space to encode any given file. But that chunk above -- not too long compared to what I might have used, like War and Peace or the US Code -- requires 1472 bits at minimum to encode if you want to have any hope of reconstructing the text.

(Commentor templatetypedef points to Kolmogorov complexity (another good explanation of that concept), which is an even better way of representing the information content of a string and the futility of cracking a hash.)

So information-theoretically (and leaving aside cheating, like having a pre-filled compression dictionary), it's impossible to reconstruct those few (simple handcrafted locally-sourced) sentences from a 32-bit int. It's a basic law of the universe, unfortunately. Ain't gonna happen.

Practical example, from code

Another commentor mentions the Pigeonhole Principle -- the simple idea that, if you have N slots (in this case 2^32), you can't put more than N things in them without putting two or more things in the same slot.

Let's take your hash function:

public static int encode(String file) {
        int hash = 0;
        file = file.toUpperCase();

        for(int i = 0; i < file.length(); i++) {
                hash = (hash * 61 + file.charAt(i)) - 32;
        }
        return hash;
}

Specifically, this line:

file = file.toUpperCase();

I want to hash two files:

mary had a little lamb

Mary Had A Little Lamb

What will their hash values be? Think about it.

(Sidenote: Even with all that said, you are overflowing int. :) Modular arithmetic is your friend if you want to do stuff like that.)

于 2013-10-21T00:09:44.223 回答