我从抓取的 html 文件中做这个 praser。这个解析器假设提取线程标题、用户帖子和总视图。我设法获得了 html 标记,但问题是它无法检索所有线程标题,而只能得到一些。
html 代码(对不起,我从网站源代码中复制了糟糕的对齐方式):
<tbody id="threadbits_forum_2">
<tr>
<td class="alt1" id="td_threadstatusicon_3396832">
<img src="http://www.hardwarezone.com.sg/img/forums/hwz/statusicon/thread_hot.gif" id="thread_statusicon_3396832" alt="" border="" />
</td>
<td class="alt2"> </td>
<td class="alt1" id="td_threadtitle_3396832" title="Updated on 3 October 2011
Please check Price Guides for latest prices
A PC Buyer’s Guide that is everything to everyone is simply not possible. This is a simple guide to putting together a PC with a local flavour. Be sure to read PC Buyer’s Guide from other media.
If you have any...">
<div>
<span style="float:right">
<img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/sticky.gif" alt="Sticky Thread" />
</span>
<font color=red><b>Sticky: </b></font>
<a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832" id="thread_title_3396832">Buyer's Guide II: Extreme, High-End, Mid-Range, Budget, and Entry Level Systems - Part 2</a>
<span class="smallfont" style="white-space:nowrap">(<img class="inlineimg" src="http://www.hardwarezone.com.sg/img/forums/hwz/misc/multipage.gif" alt="Multi-page thread" border="0" /> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832">1</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=2">2</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=3">3</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=4">4</a> <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=5">5</a> ... <a href="showthread.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&t=3396832&page=17">Last Page</a>)</span>
</div>
<div class="smallfont">
<span style="cursor:pointer" onclick="window.open('member.php?s=2a7d1dc5bbc6bf85468a79ec2e6eb86e&u=39963', '_self')">adrianlee</span>
</div>
到目前为止我的编码:
try(BufferedReader br = new BufferedReader(new FileReader(pageThread)))
{
String html = "";
while(br.readLine() != null)
{
html += br.readLine() + "\n";
}
Document doc = Jsoup.parse(html);
//To get the thread list
Elements threadsList = doc.select("tbody[id^=threadbits_forum]").select("tr");
for(Element e: threadsList)
{
//To get the title
System.out.println("Title: " + e.select("a[id^=thread_title]").text());
}
System.exit(0);
}catch(Exception e)
{
e.printStackTrace();
}
结果: 标题:
- 标题:想成为 HardwareZone 编辑团队的一员吗?
- 标题:
- 标题:pa9797 回到 PC wa new Rig !
- 标题:[EPIC] 安迪森的又一个首创,铂金模块化 PSU
- 标题:
- 标题:SLS哪家店买新cpu好?. . . 很快
您有解决此问题的方法吗?
谢谢。