character-encoding - 操作系统之间的字符串构造函数差异

Question

我有以下代码：

byte[] b = new byte[len]; //len is preset to 157004 in this example
//fill b with data by reading from a socket
String pkt = new String(b);
System.out.println(b.length + " " + pkt.length());

这会在 Ubuntu 上打印出两个不同的值；157004 和 147549，但在 OS X 上的值相同。这个字符串实际上是 ImageIO 库正在传输的图像。因此，在 OS XI 上能够将字符串解码为图像就好了，但在 Ubuntu 上我不能。

我在 OS X 上使用版本 1.6.0_45，并在 Ubuntu 上尝试了相同的版本，除了 Oracle jdk 7 和默认的 openjdk。

我注意到我可以通过使用 Latin-1 解码来使字符串长度等于字节数组长度：

String pkt = new String(b,"ISO-8859-1");

然而，这并不能解码图像，并且理解正在发生的事情可能很困难，因为字符串对我来说看起来像是垃圾。

我对我使用相同的 jdk 版本但操作系统不同的事实感到困惑。

score 7 · Accepted Answer

该字符串实际上是 ImageIO 库正在传输的图像。

这就是你出错的地方。

图像不是文本数据 - 它是二进制数据。如果确实需要将其编码为字符串，则应使用 base64。我个人喜欢iharder.net 上的公共域 base64 编码器/解码器。

This isn't just true for images - it's true for all binary data which isn't known to be text in a particular encoding... whether that's sound, movies, Word documents, encrypted data etc. Never just treat it as if it were just encoded text - it's a recipe for disaster.

score 0 · Accepted Answer

Ubuntu 默认使用 utf-8，这是一种可变长度编码，因此字符串和字节数据的长度不同。这是差异的根源，但对于解决方案，我遵从 Jon 的回答。

character-encoding - 操作系统之间的字符串构造函数差异

2 回答 2

Related

Reference