java - JSoup 核心网页文本提取

Question

我是 JSoup 的新手，如果我的问题太琐碎，对不起。我正在尝试从http://www.nytimes.com/提取文章文本，但在打印解析文档时，我无法在解析的输出中看到任何文章

public class App 
{

    public static void main( String[] args )
    {
        String url = "http://www.nytimes.com/";
        Document document;
        try {
            document = Jsoup.connect(url).get();

            System.out.println(document.html()); // Articles not getting printed
            //System.out.println(document.toString()); // Same here
            String title = document.title();
            System.out.println("title : " + title); // Title is fine

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

好的，我试图解析“ http://en.wikipedia.org/wiki/Big_data ”来检索 wiki 数据，这里也有同样的问题，我没有在输出中获取 wiki 数据。任何帮助或提示将不胜感激。

谢谢。

score 0 · Accepted Answer

以下是获取所有<p class="summary>文本的方法：

final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();

for( Element element : doc.select("p.summary") )
{
    if( element.hasText() ) // Skip those tags without text
    {
        System.out.println(element.text());
    }
}

如果你需要所有 <p>标签，没有任何过滤，你可以使用doc.select("p")。但在大多数情况下，最好只选择您需要的那些（请参阅此处获取 Jsoup 选择器文档）。

java - JSoup 核心网页文本提取

1 回答 1

Related

Reference