我通过消息代理(Stomp)得到这个字符串:
João
这就是它应该是的样子:
João
有没有办法在Java中恢复它?!谢谢!
U+00C3 Ã c3 83 LATIN CAPITAL LETTER A WITH TILDE
U+00C2 Â c3 82 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+00A3 £ c2 a3 POUND SIGN
U+00E3 ã c3 a3 LATIN SMALL LETTER A WITH TILDE
我无法确定这可能是数据(编码)转换问题。有没有可能数据很糟糕?
如果数据还不错,那么我们必须假设您误解了编码。我们不知道原始编码,除非您做一些不同的事情,否则 Java 的默认编码是 UTF-16。我看不出João
任何常见编码中的编码如何被解释为João
UTF-16
可以肯定的是,我在没有找到匹配项的情况下启动了这个 python 脚本。我不完全确定它涵盖了所有编码,或者我没有错过一个极端情况,FWIW。
#!/usr/bin/env python
# -- coding: utf-8 --
import pkgutil
import encodings
good = u'João'
bad = u'João'
false_positives = set(["aliases"])
found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found
for x in found:
for y in found:
res = None
try:
res = good.encode(x).decode(y)
print res,x,y
except:
pass
if not res is None:
if res == bad:
print "FOUND"
exit(1)
在某些情况下,黑客是有效的。但最好的办法是防止它发生。
当我有一个 servlet 可以在页面上正确打印正确的标题和 http 内容类型和编码时,我遇到了这个问题,但是 IE 会提交用 latin1 编码的表单而不是正确的表单。所以我创建了一个快速的脏 hack(涉及一个请求包装器,它可以检测并转换它是否确实是 IE)来修复它以获取新数据,这些数据运行良好。对于数据库中已经搞砸的数据,我使用了以下 hack。
不幸的是,我的 hack 不适用于您的示例字符串,但它看起来非常接近(与我的“理论原因”复制的损坏字符串相比,您的损坏字符串中只是一个额外的 Ã)。所以也许我对“latin1”的猜测是错误的,你应该尝试其他的(比如在 Tomas 发布的其他链接中)。
package peter.test;
import java.io.UnsupportedEncodingException;
/**
* User: peter
* Date: 2012-04-12
* Time: 11:02 AM
*/
public class TestEncoding {
public static void main(String args[]) throws UnsupportedEncodingException {
//In some cases a hack works. But best is to prevent it from ever happening.
String good = "João";
String bad = "João";
//this line demonstrates what the "broken" string should look like if it is reversible.
String broken = breakString(good, bad);
//here we show that it is fixable if broken like breakString() does it.
fixString(good, broken);
//this line attempts to fix the string, but it is not fixable unless broken in the same way as breakString()
fixString(good, bad);
}
private static String fixString(String good, String bad) throws UnsupportedEncodingException {
byte[] bytes = bad.getBytes("latin1"); //read the Java bytes as if they were latin1 (if this works, it should result in the same number of bytes as java characters; if using UTF8, it would be more bytes)
String fixed = new String(bytes, "UTF8"); //take the raw bytes, and try to convert them to a string as if they were UTF8
System.out.println("Good: " + good);
System.out.println("Bad: " + bad);
System.out.println("bytes1.length: " + bytes.length);
System.out.println("fixed: " + fixed);
System.out.println();
return fixed;
}
private static String breakString(String good, String bad) throws UnsupportedEncodingException {
byte[] bytes = good.getBytes("UTF8");
String broken = new String(bytes, "latin1");
System.out.println("Good: " + good);
System.out.println("Bad: " + bad);
System.out.println("bytes1.length: " + bytes.length);
System.out.println("broken: " + broken);
System.out.println();
return broken;
}
}
结果(使用 Sun jdk 1.7.0_03):
Good: João
Bad: João
bytes1.length: 5
broken: João
Good: João
Bad: João
bytes1.length: 5
fixed: João
Good: João
Bad: João
bytes1.length: 6
fixed: Jo�£o