1
4

2 回答 2

3

After further inspection, I found that my web page loads itself in two increments. I imagine it is because of the sheer volume of data. The last entry in my jSoup array of <a> tags corresponded to the last <a> on the first increment of the page.

I got around this by pulling the HTML down separately with this method:

private static String getHtml(String location) throws IOException {
    URL url = new URL(location);
    URLConnection conn = url.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    String input;
    StringBuilder builder = new StringBuilder();
    while((input = in.readLine()) != null)
    {
         builder.append(input);
    }

    return builder.toString();
}

And then calling the Jsoup.parse method on the resulting string. This meant I had all the data, and it actually improves performance (although for the life of me I don't know how).

于 2013-09-03T08:10:16.613 回答
1

我使用包含超过 50.000(五万)个锚标记的生成的 HTML 文件测试了 jsoup。

Jsoup 完全解析了这些文件,并且能够正确选择所有锚元素和 href 属性......

所以恕我直言,这不是基本的 jsoup 问题。

于 2014-09-27T06:42:20.633 回答