2 回答
After further inspection, I found that my web page loads itself in two increments. I imagine it is because of the sheer volume of data. The last entry in my jSoup
array of <a>
tags corresponded to the last <a>
on the first increment of the page.
I got around this by pulling the HTML down separately with this method:
private static String getHtml(String location) throws IOException {
URL url = new URL(location);
URLConnection conn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String input;
StringBuilder builder = new StringBuilder();
while((input = in.readLine()) != null)
{
builder.append(input);
}
return builder.toString();
}
And then calling the Jsoup.parse
method on the resulting string. This meant I had all the data, and it actually improves performance (although for the life of me I don't know how).
我使用包含超过 50.000(五万)个锚标记的生成的 HTML 文件测试了 jsoup。
Jsoup 完全解析了这些文件,并且能够正确选择所有锚元素和 href 属性......
所以恕我直言,这不是基本的 jsoup 问题。