java - Jsoup 在文档中仅找到 ~9000 个标签中的一半

Question

score 3 · Accepted Answer

After further inspection, I found that my web page loads itself in two increments. I imagine it is because of the sheer volume of data. The last entry in my jSoup array of <a> tags corresponded to the last <a> on the first increment of the page.

I got around this by pulling the HTML down separately with this method:

private static String getHtml(String location) throws IOException {
    URL url = new URL(location);
    URLConnection conn = url.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    String input;
    StringBuilder builder = new StringBuilder();
    while((input = in.readLine()) != null)
    {
         builder.append(input);
    }

    return builder.toString();
}

And then calling the Jsoup.parse method on the resulting string. This meant I had all the data, and it actually improves performance (although for the life of me I don't know how).

score 1 · Accepted Answer

我使用包含超过 50.000（五万）个锚标记的生成的 HTML 文件测试了 jsoup。

Jsoup 完全解析了这些文件，并且能够正确选择所有锚元素和 href 属性......

所以恕我直言，这不是基本的 jsoup 问题。

java - Jsoup在文档中仅找到 ~9000 个标签中的一半

2 回答 2

Related

Reference

java - Jsoup 在文档中仅找到 ~9000 个标签中的一半