5

我读到 Java 在内部使用 UTF-16 编码。即我明白,如果我喜欢: String var = "जनमत"; 那么“जनमत”将在内部以 UTF-16 编码。所以,如果我将此变量转储到某个文件,如下所示:

fileOut = new FileOutputStream("output.xyz");
out = new ObjectOutputStream(fileOut);
out.writeObject(var);

文件“output.xyz”中字符串“जनमत”的编码是否为UTF-16?另外,稍后如果我想通过 ObjectInputStream 从文件“output.xyz”中读取,我能否获得该变量的 UTF-16 表示?

谢谢。

4

3 回答 3

7

So, If I dump this variable to some file... will the encoding of the string "जनमत" in the file "output.xyz" be in UTF-16?

The encoding of your string in the file will be in whatever format the ObjectOutputStream wants to put it in. You should treat it as a black box that can only be read by an ObjectInputStream. (Seriously - even though the format is IIRC well-documented, if you want to read it with some other tool, you should serialise the object yourself as XML or JSON or whatever.)

Later on if I want to read from the file "output.xyz" via ObjectInputStream, will I be able to get the UTF-16 representation of the variable?

If you read the file with an ObjectInputStream, you'll get a copy of the original object back. This will include a java.lang.String, which is a just stream of characters (not bytes) - from which you could get the UTF-16 representation if you wished via the getBytes() method (though I suspect you don't actually need to).


In conclusion, don't worry too much about the internal details of serialization. If you need to know what's going on, create the file yourself; and if you're just curious, trust in the JVM to do the right thing.

于 2010-12-08T17:41:52.373 回答
1

Close: it is not exactly UTF-16, but something like UCS-2; but either way it does use 2 bytes for most characters (and sequence of 2 chars, i.e. 4 bytes for some rarely used code points).

ObjectOutputStream uses something called modified UTF-8, which is like UTF-8 but where zero character is expressed as 2-byte sequence which is not legal as per UTF-8 (due to uniqueness restrictions of encoding), but that sort of naturally decodes back to value 0.

But what you are really asking is "does it work so that I write a String, read a String" -- and answer to that is yes. JDK does proper encoding when writing bytes out, and decoding when reading.

For what it's worth, you are better of using "writeUTF()" method for Strings, since I think resulting output is bit more compact. but "writeObject()" also works, just needs bit more metadata.

于 2010-12-08T17:40:58.037 回答
0

只是补充一点,ObjectOutputStream.writeString()将确定给定字符串的 UTF 长度并将其写入“标准”UTF 或“长”UTF 格式,其中“长”如 javadoc 中所述

“长”UTF 格式与标准 UTF 相同,只是它使用 8 字节标头(而不是标准的 2 字节)来传达 UTF 编码长度。

我从代码中得到这个...

private void writeString(String str, boolean unshared) throws IOException {
    handles.assign(unshared ? null : str);
    long utflen = bout.getUTFLength(str);
    if (utflen <= 0xFFFF) {
        bout.writeByte(TC_STRING);
        bout.writeUTF(str, utflen);
    } else {
        bout.writeByte(TC_LONGSTRING);
        bout.writeLongUTF(str, utflen);
    }
}

并在writeObject(Object obj)他们进行检查

if (obj instanceof String) {
    writeString((String) obj, unshared);
}
于 2010-12-08T17:51:55.450 回答