java - 如何从网页中提取所有文本？

Question

我正在使用 JSoup 库来提取网页中的文本。以下是我的代码

 Document doc;

try {
 URL url = new URL(text);


 doc = Jsoup.parse(url, 70000);

 Elements paragraphs = doc.select("p");
 for(Element p : paragraphs)
 {

    textField.append(p.text());
    textField.append("\n");
 }
} 
catch (Exception ex)
{

   ex.printStackTrace();

}

在这里，我只能从“p”标签中获取文本。但我需要页面中的所有文本。我该怎么做？这可能是通过遍历节点，但我刚刚开始使用JSoup.

score 1 · Accepted Answer

尝试这个：

String text = Jsoup.parse(new URL("https://www.google.com"), 10000).text();
System.out.println(text);

这里，10000 以毫秒为单位，表示超时。

score 0 · Accepted Answer

您可能想要使用Boilerpipe，因为您不需要 HTML 解析，而只需要文本提取。这应该更快，CPU消耗更少。

例子：

URL url = new URL("http://www.example.com/some-location/index.html");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor.INSTANCE.getText(url);

取自：https ://code.google.com/p/boilerpipe/wiki/QuickStart

score 0 · Accepted Answer

也许完全不同的方法。我不确定你在做什么，因此我不知道你需要什么。但是您可以获取整个网页的整个原始来源。然后使用正则表达式删除所有的 html 标签。我曾经为文本代码比工具做了类似的事情（尽管在 php 中）。

java - 如何从网页中提取所有文本？

3 回答 3

Related

Reference