java - Malformed binary serialization of HashMap

Question

I wrote some code to serialize a HashMap<String,Double> by iterating entries and serializing each of them instead of using ObjectOutputStream.readObject(). The reason is just efficiency: the resulting file is much smaller and it is much faster to write and read (eg. 23 MB in 0.6 seconds vs. 29 MB in 9.9 seconds).

This is what I did to serialize:

ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream("test.bin"));
oos.writeInt(map.size()); // write size of the map
for (Map.Entry<String, Double> entry : map.entrySet()) { // iterate entries
    System.out.println("writing ("+ entry.getKey() +","+ entry.getValue() +")");
    byte[] bytes = entry.getKey().getBytes();
    oos.writeInt(bytes.length); // length of key string
    oos.write(bytes); // key string bytes
    oos.writeDouble(entry.getValue()); // value
}
oos.close();

As you can see, I get the byte array for each key String, serialize its length and then the array itself. This is what I did to deserialize:

ObjectInputStream ois = new ObjectInputStream(new FileInputStream("test.bin"));
int size = ois.readInt(); // read size of the map
HashMap<String, Double> newMap = new HashMap<>(size);
for (int i = 0; i < size; i++) { // iterate entries
    int length = ois.readInt(); // length of key string
    byte[] bytes = new byte[length];
    ois.read(bytes); // key string bytes
    String key = new String(bytes);
    double value = ois.readDouble(); // value
    newMap.put(key, value);
    System.out.println("read ("+ key +","+ value +")");
}

The problem is that at some point the key is not serialized correctly. I've been debugging to the point where I could see that ois.read(bytes) read 8 bytes instead of 16 as it was supposed to, so the key String was not properly formed and the double value was read using the last 8 bytes from the key that were not read yet. In the end, Exceptions everywhere.

Using the sample data below, the output will be like this at some point:

read (2010-00-056.html,12154.250518054876)
read (2010-00-        ,1.4007397428546247E-76)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at ti.Test.main(Test.java:82)

The problem can be seen in the serialized file (it should read 2010-00-008.html):

enter image description here

two bytes are added in between the String key. See MxyL's answer for further info about this. So it all boils down to: why are those two bytes added, and why readFully works ok?

Why isn't the String properly (de)serialized? Might it be some kind of padding to a fixed block size or something like that? Is there a better way to manually serialize a String when looking for efficiency? I was expecting some kind of writeString and readString, but seems there is no such thing in Java's ObjectStream.

I've been trying using buffered streams just in case there is something wrong there, explicitly saying how many bytes to write and to read, using different encodings, but no luck.

This is some sample data to reproduce the problem:

HashMap<String, Double> map = new HashMap<String, Double>();
map.put("2010-00-027.html",21732.994621513037); map.put("2010-00-020.html",3466.5169348296736); map.put("2010-00-051.html",12528.648992702407); map.put("2010-00-062.html",3354.8950010256385);
map.put("2010-00-024.html",10295.095511718278); map.put("2010-00-052.html",5381.513344679818);  map.put("2010-00-007.html",16466.33813960735);  map.put("2010-00-017.html",9484.969198176652);
map.put("2010-00-054.html",15423.873112634772); map.put("2010-00-022.html",8123.842752870753);  map.put("2010-00-033.html",21238.496665104063); map.put("2010-00-028.html",7578.792651786424);
map.put("2010-00-048.html",3566.4118233046393); map.put("2010-00-040.html",2681.0799941861724); map.put("2010-00-049.html",14308.090890746222); map.put("2010-00-058.html",5911.342406606804);
map.put("2010-00-045.html",2284.118716145881);  map.put("2010-00-031.html",2859.565771680721);  map.put("2010-00-046.html",4555.187022907964);  map.put("2010-00-036.html",8479.709295569426);
map.put("2010-00-061.html",846.8292195815125);  map.put("2010-00-023.html",14108.644025417952); map.put("2010-00-041.html",22686.232732684934); map.put("2010-00-025.html",9513.539663409734);
map.put("2010-00-012.html",459.6427911376829);  map.put("2010-00-005.html",0.0);    map.put("2010-00-013.html",2646.403220496738);  map.put("2010-00-065.html",5808.86423609936);
map.put("2010-00-056.html",12154.250518054876); map.put("2010-00-008.html",10811.15198506469);  map.put("2010-00-042.html",9271.006516004005);  map.put("2010-00-000.html",4387.4162586468965);
map.put("2010-00-059.html",4456.211623469774);  map.put("2010-00-055.html",3534.7511584735325); map.put("2010-00-057.html",8745.640098512009);  map.put("2010-00-032.html",4993.295735075575);
map.put("2010-00-021.html",3852.5805998017922); map.put("2010-00-043.html",4108.020033536286);  map.put("2010-00-053.html",2.2446400279239946); map.put("2010-00-030.html",17853.541210836203);

score 2 · Accepted Answer

ois.read(bytes); // key string bytes

将此更改为使用 readFully()。您假设读取填充了缓冲区。它没有义务传输超过一个字节。

在寻找效率时，有没有更好的方法来手动序列化字符串？

有 writeUTF() 和 readUTF() 对。

您应该注意，通过调用 getBytes() 您正在引入平台依赖项。您应该在此处以及在重构字符串时指定字符集。

score 1 · Accepted Answer

这里有两件事需要注意

首先，如果您取出样本数据中的最后 4 个条目，则不会发生错误。也就是说，这两个字节没有被错误地相加。诡异的。

其次，如果您在十六进制编辑器中打开文件，然后向下滚动到出现两个额外字节的条目，您将看到它以一个正确值为 16 的 4 字节整数开头（请记住，这是大端）。然后您会看到带有两个额外字节的字符串，然后是与之关联的双字节。

现在，奇怪的是 Java 是如何读取这些字节的。首先，它按照您的指示读取字符串的长度。然后它尝试读取 16 个字节......但在这里它似乎未能读取 16 个字节，因为您的打印语句显示

read (2010-00-,1.3980409401811577E-76))

现在将光标放在这两个奇怪的字节之后，你会看到这个

从字符串开始的位置到当前指针所在的位置，它似乎只读取了 10 个字节。

此外，当我尝试从我的 IDE 控制台复制该行时，它只粘贴了

read (2010-00-

通常当一个字符串突然在我的复制粘贴中结束时，我通常怀疑是空字节。看看我的剪贴板，确实，看起来字节没有被完全读入缓冲区：

好的，所以看起来 Java 只能读取 10 个字节并继续前进，这解释了字符串和之后的数字。

所以看起来当你read传入一个缓冲区时，它并没有被完全填满。工具提示本身甚至有一个建议告诉我使用readFully!

所以做了一些测试，我继续改变

ois.read(bytes); // key string bytes

至

ois.readFully(bytes, 0, length); // key string bytes

无论出于何种原因，这都有效。

read (2010-00-013.html,2646.403220496738)
read (2010-00-005.html,0.0)
read (2010-00-056.html,12154.250518054876)
read (2010-00-008.html,10811.15198506469)
read (2010-00-042.html,9271.006516004005)
read (2010-00-000.html,4387.4162586468965)  // where it was failing before
read (2010-00-059.html,4456.211623469774)

问题

现在，它实际上起作用的事实是一个问题。为什么它起作用？很明显，您的字符串之间有两个额外的字节（导致它的长度为 18，而不是 16）。它不像文件已经改变或任何东西。

事实上，当我手动编辑文件使其只有三个条目时，我指出只有两个，这是我得到的输出：

图像3

read (2010-00-056.html,12154.250518054876)
read (2010-00-wd008.ht,1.2466701288348126E219)

这是我对 18 个字节的字符串的期望（嗯，也许不是wd，我期望w,的），但是您指定只有 16 个。您应该同意使用readFully实际有效的事实很奇怪。

所以有几个谜团

为什么要添加这两个额外的字节
为什么在您删除最后 4 个条目时不添加它们（如果需要，可以添加更多）
为什么使用readFully工作，其他一切都不变？

不幸的是，这个答案没有回答你的问题，我现在也很困惑，不仅是你提出的问题，还有我所看到的行为。

score 0 · Accepted Answer

ObjectOutputStream 首先写入 STREAM_MAGIC(0xaced)，然后写入 STREAM_VERSION(5)，然后写入 TC_BLOCKDATALONG (0x7A)，然后写入块大小 (1024)，对于最后一个块，如果长度小于 255，则写入 TC_BLOCKDATA (0x77) 和块大小（长度为最后一块）

因此，当 ObjectOutputStream 使用 readFully 时，它首先将数据读取到跳过 STREAM_MAGIC、STREAM_VERSION 的缓冲区，然后对于每个块，读取块大小以获取大小，然后将所有大小数据读取到缓冲区

score -1 · Accepted Answer

ObjectInputStream#read不保证它会读取 buffer.length() 字节数。当读取发生在当前预读缓冲区块的边缘时，它只会返回缓冲区中剩余的字节数。应该这样写。

        int offset=0;
        while(offset<length) {
            int cnt=ois.read(bytes,offset, length-offset); // key string bytes
            offset+=cnt;
        }

java - Malformed binary serialization of HashMap

4 回答 4

问题

Related

Reference