java - 从html文件中提取某些文本

Question

我想从放置在 parapraph(p) 和 link(a href) 标记之间的 html 文件中提取文本。我想在没有java regex 和 html解析器的情况下做到这一点。我想

while ((word = reader.readLine()) !=null) { //iterate to the end of the file
    if(word.contains("<p>")) { //catching p tag
        while(!word.contains("</p>") { //iterate to the end of that tag
            try { //start writing
                out.write(word);
            } catch (IOException e) {
            }
        }
    }
}

但不起作用。代码对我来说似乎很有效。读者如何捕捉“p”和“a href”标签。

score 3 · Accepted Answer

当你<p>blah</p>在一行中有这样的东西时，问题就开始了。一个简单的解决方案是将所有内容更改<为\n<- 如下所示：

boolean insidePar = false;
while ((line = reader.readLine()) !=null) {
    for(String word in line.replaceAll("<","\n<").split("\n")){
        if(word.contains("<p>")){
            insidePar = true;
        }else if(word.contains("</p>")){
            insidePar = false;
        }
        if(insidePar){ // write the word}
    }
}

不过我也建议使用像@HovercraftFullOfEels 这样的解析器库。

编辑：我已经更新了代码，所以它更接近工作版本，但在此过程中可能会有更多问题。

score 0 · Accepted Answer

0

我认为为此使用库会更容易。使用这个http://jsoup.org/。您还可以解析字符串

于 2013-05-18T14:07:54.007 回答

java - 从html文件中提取某些文本

2 回答 2

Related

Reference