1

所以这是场景。我有一个大的 html 文件,我想使用 JSoup 来抓取它。我是新手,我一直在阅读一些教程和 API 参考。我有以下 html 块。

<p><a name="bob"></a>
<table class='schedules'>
<tr><td  align='center' colspan="5"><b>Bob the Builder</b><br>
<a href="blah blah" class='tiny'>Blah Blah Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><!--<td class='whoohaa'><a href="random/randomUrl.htm">Blah</a></td>--><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td></tr>
</table>
</p>

现在有更多这些块遵循类似的模式,其中(在第一行中)名称属性发生了变化(从“鲍勃”到其他东西)。我想做的是首先能够选择“bob” p 块,然后检索所有 html,直到最后一行中的终止 p 块。

我尝试了以下方法:

Elements innerStuff = doc.select("a:contains(bob) ~ *");

但它只给了我与 href 属性的链接,我猜这是预期的结果。但是,我正在努力看看我还能如何解决这个问题?

非常感谢您在这方面的帮助。

4

1 回答 1

1

根据 name 属性选择 a 标签的更严格的方法是:

doc.select("a[name=bob]")

从那里,您应该能够使用 parent() 导航到您想要的元素(以获取包含链接的 p 标签)例如(您需要先调用 first() 才能获取第一个(也是唯一的)元素匹配选择器):

doc.select("a[name=bob]").first().parent()

但是有一个问题:解析后的 H​​TML 文档与原始 HTML 不同:这是原始 HTML 结构:

p
    a[name=bob]
    table
        ...

解析后的 H​​TML 如下所示:

p
    a[name=bob]
table
    ...
p

因此,从链接标签开始,要获取该表格的元素,您需要上一层(到段落)并获取下一个元素:

doc.select("a[name=bob]").first().parent().nextElementSibling()
于 2012-11-11T12:24:57.513 回答