java - HttpClient 下载具有损坏字符的 txt 文件

Question

我正在尝试从服务器中提取一些 txt 文件，但是文件字符集是 UTF-8。我的代码能够下载文件，但它也产生了一些奇怪的字符。

悉尼的海水淡化厂

如果我直接使用 chrome 下载它，它会正确显示为：

悉尼的海水淡化厂

以下是我当前的代码：

public String getURL(String url) throws Exception
{
    StringBuffer result=new StringBuffer();
    if(StringUtils.isNotBlank(url) && url.startsWith("http"))
    {
        HttpClient client = new DefaultHttpClient();
        client.getParams().setParameter("http.protocol.content-charset", "UTF-8");
        HttpGet request = new HttpGet(url);

        // add request header
        //request.addHeader("User-Agent", "");
        //request.addHeader(Content-Type: text/html; charset=UTF-8)
        HttpResponse response = client.execute(request);

        System.out.println("Response Code : " + response.getStatusLine().getStatusCode());
        if(response.getStatusLine().getStatusCode() == 200)
        {

            //System.out.println(response.getEntity().getContentType().getValue());
            BufferedReader rd = new BufferedReader(
                new InputStreamReader(response.getEntity().getContent(),"UTF-8"));
            //result=(EntityUtils.getContentCharSet(response.getEntity()));
            boolean flagIn = false;
            String sCurrentLine;
            while ((sCurrentLine = rd.readLine()) != null) 
            {
                //if(flagIn==false)
                //{
                //  sCurrentLine = removeUTF8BOM(sCurrentLine);
                //}

                if(flagIn)
                {
                    result.append("\n");
                }   
                 result.append(sCurrentLine);

                flagIn = true;
            }

        }
    }
    return result.toString();

}

以下是尝试调用的方法：

System.out.println(former.getURL("http://photos.gcdis-india.com/png/bio/QSPNGC1002.txt"));

知道我应该修复哪一部分吗？我需要提供任何特殊的 http 标头吗？或者读者是这里的问题？

score 4 · Accepted Answer

好的，这就是交易，在使用您的 URL 尝试您的代码后，我可以告诉您。

首先，不要假设你有 UTF-8。始终使用 HTTP 响应标头中的任何字符编码。

在您的情况下，响应标头中没有实际的编码，因此您必须回退到某个默认值。这就是事情变得不确定的地方。

许多消息来源建议回退到 windows-1252，它可以正确解码撇号。text/html 的默认值为 iso-8859-1 ( http://www.w3.org/International/O-HTTP-charset ) 但 iso-8859-1 不能正确解码该字符。

我找不到任何硬参考表明 windows-1252 应该是 text/plain 的默认值。但是，我可以找到几乎每个文本/纯文本请求示例都默认使用该编码。因此，我只能得出结论，它通常是一个安全的后备方案。

所以我想说：

从响应标头（或您的实体）中获取字符集。
如果没有，并且您的内容类型是 text/plain，则默认为 windows-1252。如果您的内容类型是 text/html 默认为 iso-8859-1 （编辑：或者如果您想要更健壮，首先将内容解码为 us-ascii，在 html 元标记中查找字符编码，然后解码那样，否则 iso-8859-1)。
将该内容类型指定为InputStream. 不要假设 utf-8。

到目前为止，我发现的所有内容都表明上述内容涵盖了大多数情况。我将继续四处寻找确定的来源。

java - HttpClient 下载具有损坏字符的 txt 文件

1 回答 1

Related

Reference