0

我知道这个问题被问了很多次,但是我被这个问题困住了,我读过的任何东西都没有帮助我。

我有这个代码:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();

我正在尝试获取此网页的内容http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/并且所有非拉丁符号都显示错误。

我尝试设置编码,如:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "WINDOWS-1251"));

在这一点上一切都很好!但我无法更改我尝试解析的每个网站的编码,我需要一些解决方案。

所以伙计们,我知道检测编码并不像看起来那么容易,但我真的需要它。如果有人有这样的问题,请解释一下你是如何解决的!

任何帮助appriciated!

这是我用来获取内容的函数的完整代码:

protected Map<String, String> getFromUrl(String url){
    Map<String, String> mp = new HashMap<String, String>();
    String newCookie = "", redirect = null;
    try{
        String host = this.getHostName(url), content = "", header = "", UA = this.getUA(), cookie = this.getCookie(host, UA), referer = "http://"+host+"/";
        URL U = new URL(url);
        URLConnection conn = U.openConnection();
        conn.setRequestProperty("Host", host);
        conn.setRequestProperty("User-Agent", UA);
        conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3");
        conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
        conn.setRequestProperty("Accept-Charset", "utf-8;q=0.7,*;q=0.7");
        conn.setRequestProperty("Keep-Alive", "115");
        conn.setRequestProperty("Connection", "keep-alive");
        conn.setRequestProperty("Connection", "keep-alive");
        if(referer != null)conn.setRequestProperty("Referer", referer);
        if(cookie != null && !cookie.contentEquals(""))conn.setRequestProperty("Cookie", cookie);
        for(int i=0; ; i++){
            String name = conn.getHeaderFieldKey(i);
            String value = conn.getHeaderField(i);
            if(name == null && value == null)break; 
            else if(name != null)if(name.contentEquals("Set-Cookie"))newCookie += value + " ";
            else if(name.toLowerCase().trim().contentEquals("location"))redirect = value;
            header += name + ": " + value + "\r\n";
        }
        if(!newCookie.contentEquals("") && !newCookie.contentEquals(cookie))this.setCookie(host, UA, newCookie.trim());
        try{
            BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            while((line = reader.readLine()) != null)content += line+"\r\n";
            reader.close();
        }
        catch(Exception e){/*System.out.println(url+"\r\n"+e);*/}
        mp.put("url", url);
        mp.put("header", header);
        mp.put("content", content);
    }
    catch(Exception e){
        mp.put("url", "");
        mp.put("header", "");
        mp.put("content", "");
    }
    if(redirect != null && this.redirectCount < 3){
        mp = getFromUrl(redirect);
        this.redirectCount++;
    }
    return mp;
}
4

2 回答 2

1

以 jsoup为例。检测随机网站的字符编码是一个复杂的问题,因为存在/不存在的标题和 2 个不同的元标记。例如,您链接的页面不会在 Content-Type 标头中发送字符集。

而且你无论如何都需要一个 HTML 解析器,你没想过使用正则表达式,是吗?

这是示例用法:

Connection connection = Jsoup.connect("http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/");
connection
    .header("Host", host)
    .header("User-Agent", UA)
    .header("Accept", "text/html,application/xhtml+xml,application/xmlq=0.9,*/*q=0.8")
    .header("Accept-Language", "ru-ru,ruq=0.8,en-usq=0.5,enq=0.3")
    .header("Accept-Encoding", "gzip,deflate")
    .header("Accept-Charset", "utf-8q=0.7,*q=0.7")
    .header("Keep-Alive", "115")
    .header("Connection", "keep-alive");

connection.followRedirects(true);

Document doc = connection.get();

Map<String, String> cookies = connection.response().cookies();

Elements titles = doc.select(".title");
for( Element title : titles ) {
    System.out.println(title.ownText());
}

输出:

Шины Marangoni E-COMM
Описание шины Marangoni E-COMM
于 2013-03-29T11:42:46.803 回答
0

您要查找“Content-Type”标头:

内容类型:文本/html;字符集=utf-8

那里的“字符集”部分就是您要查找的内容。

于 2013-03-29T01:51:49.280 回答