java - 在 Java 中翻录 html 页面源代码问题

Question

我正在尝试翻录网站的 html 页面源以获取电子邮件。当我运行 ripper/dumper 或任何你想调用它的东西时，它会获取所有源代码但在第 160 行停止但我可以手动转到网页>右键单击>单击查看页面源然后解析文本。整个源代码只有 200 多行。手动转到每个页面并右键单击的唯一问题是有超过 100k 个页面并且需要一段时间。

这是我用来获取页面源的代码：

    public static void main(String[] args) throws IOException, InterruptedException {

    URL url = new URL("http://www.runelocus.com/forums/member.php?102786-wapetdxzdk&tab=aboutme#aboutme");
    URLConnection connection = url.openConnection();

    connection.setDoInput(true);
    InputStream inStream = connection.getInputStream();
    BufferedReader input = new BufferedReader(new InputStreamReader(
            inStream));

    String html = "";
    String line = "";
    while ((line = input.readLine()) != null)
        html += line;
    System.out.println(html);
    }

score 1 · Accepted Answer

如果您尝试抓取 HTML 页面的内容，则不应使用这样的原始连接。使用现有库：HTML Unit是一个非常常用的库。

你传入 URL，它给你一个代表页面的对象，你得到所有的 HTML 标记作为对象（例如，你得到元素的 Div 对象，元素的 HTMLAnchor 对象等）。使用 HTML Unit 等现有框架并读取页面内容将使您的生活变得更加轻松。

您还可以进行搜索（例如 elementById、elementByTagName、按属性等），这使得在给定页面标记的情况下更容易在文档中跳转。

您还可以根据需要模拟点击等。

score 0 · Accepted Answer

Upon looking at this, my best guess is that your while loop conditional is bad. I'm unfamiliar with the syntax you're using. Mind you, I have not used Java in awhile. But I feel like it should read...

String line = input.readLine();
while(line != null)
{
    html += line; //should use a StringBuilder here for optimization
    line = input.readLine();
}

I do note the StringBuilder optimization. Also, I think this would be easier using the Scanner class.

score 0 · Accepted Answer

当您打开具有不同字符集的 InputStreamReader 时可能会有所帮助？查看您提到的页面，字符集是 ISO-8859-1：

BufferedReader input = 
    new BufferedReader(new InputStreamReader(inStream, "ISO-8859-1"));

score 0 · Accepted Answer

我运行了你的代码，它似乎得到了所有的 HTML，包括 HTML 结束标记。

您是否想过您可能必须登录网站才能查看更多信息？在这种情况下，用户 tsOverflow 建议的库可能会有所帮助。

java - 在 Java 中翻录 html 页面源代码问题

4 回答 4

Related

Reference