java - utf-8 转换并不总是有效

Question

我之前搜索过其他堆栈以在这里输入，但没有找到类似的东西。我必须抓取包含类似文本的不同 utf-8 网页

“Oggi è una bellissima giornata”

问题出在字符“è”上

我使用 jtidy 和 xpath 查询表达式提取此文本并将其转换为

byte[] content = filteredEncodedString.getBytes("utf-8");
String result = new String(content,"utf-8");

其中filteredEncodedString 包含文本“Oggièuna bellissima giornata”。此过程适用于迄今为止分析的大多数网页，但在某些情况下它不会提取 utf-8 字符串。页面编码总是与文本相似。

9月14日编辑

我修改了我的代码，以获取 utf-8 编码的页面：

URL url = new URL(currentUrl);
        URLConnection conn = url.openConnection();
        conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
        BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), getEncode()));

        String line="";
        String domString="";
        while((line = in.readLine()) != null) {
            domString+=line.toString();
        }

        byte[] bytes = domString.getBytes("UTF-8");
        in.close();

        return bytes;
        //return text.getBytes();

其中getEncode()返回页面编码，在这种情况下为 utf-8。但我仍然注意到 ì 或 é 没有正确读取。这段代码有问题吗？再次感谢！

10月2日编辑

这段代码似乎有效。问题出在我没有发布的 Dom 文档创建中（对此感到抱歉！），其中字节从上面的方法返回。

score 1 · Accepted Answer

事后您不能将字符串“转换”为 utf-8。如果字节被错误地转换为字符，那么您已经丢失了数据。

score 0 · Accepted Answer

您可以尝试将页面作为字节数组而不是字符串，然后使用StringUtils将其转换为 utf-8 字符串。

java - utf-8 转换并不总是有效

2 回答 2

Related

Reference