0

你好,

我想提取div标签之间的文本

<div class="innercontenttxt"> 
<p>img border="1" align="left" height="170" width="324" vspace="3" hspace="2" src="/tmdbuserfiles/ramdev-balakrishna(1).jpg" alt="ramdev aide remanded, lakrishna acharya judicial remand, ramdev aide fake passport case, baba ramdev assistant judicial custody, balakrishna sent to judicial custody, yoga guru ramdev assistant remanded, yoga guru ramdev assistant balakrishna" />
Yoga guru Ramdev's aide Balakrishna Acharya remanded to 14 days judicial custody in a fake passport on Saturday. He was arrested yesterday after he failed to appear at a Dehradun court.
    <br />
    <br />
     Balakrishna Acharya, who is basically a Nepalese citizen, 
     is alleged to have submitted fake documents to procure a passport. 
     When he failed to appear in Dehradun court in connection with the case,
</p>  
</div>

提取后的结果应该是:

ramdev 助手 alakrishna Acharya 周六以假护照被还押候审 14 天。他昨天因未能在德拉敦法院出庭而被捕。据称,基本上是尼泊尔公民的巴拉克里希纳·阿查里亚(Balakrishna Acharya)提交了伪造文件以获取护照。当他未能就该案出庭时,法院已发出不可保释令,随后于昨天逮捕了他。

4

2 回答 2

1

您可能想尝试一些 Java HTML 解析器库

HTML 解析器 - http://htmlparser.sourceforge.net

jsoup - http://jsoup.org/

于 2012-07-26T04:52:56.433 回答
1

这个问题似乎与其他问题相似。

假设您已经将 html 源代码存储在名为 htmlPage 的字符串变量中。

int divIndex = htmlPage.indexOf("<div");
divIndex = htmlPage.indexOf(">", divIndex);

int endDivIndex = htmlPage.indexOf("</div>", divIndex);
String content = htmlPage.substring(divIndex + 1, endDivIndex);
于 2012-07-26T14:59:55.117 回答