python - 使用 BeautifulSoup 获取特定数据

Question

我page.prettify()以前是整理 HTML 的，现在要提取的文本如下：

        <div class="item">
         <b>
          name
         </b>
         <br/>
         stuff here
        </div>

我的目标是stuff here从那里提取，但我很困惑，因为它没有包含在任何标签中，除了div，其中已经有其他东西了。而且每行前面的额外空白也使它更难。

这样做的方法是什么？

score 2 · Accepted Answer

find 和 nextSibling 的组合适用于您发布的示例。

soup = BeautifulSoup(""" <div class="item"> <b> name </b>  <br/>  stuff here </div>""")
soup.find("div", "item").find('br').nextSibling

score 1 · Accepted Answer

如果你真的确定，你想获取在最后一个之前结束并在特定标签之后开始的内容，你可以在这一点之后使用 RegExp，这不是最优雅的，但如果你的要求是特定的，它可能会起作用。

score 0 · Accepted Answer

您可以使用元素的.contents属性div直接获取其中的所有元素，然后选择一个字符串。

编辑：

这就是我所暗示的方法：

from bs4 import BeautifulSoup
from bs4.element import NavigableString

soup = BeautifulSoup("""<div class='item'> <b> name </b>  <br/>  stuff here </div>""")
div = soup.find('div')
print ''.join([el.strip() for el in div.contents if type(el) == NavigableString])

python - 使用 BeautifulSoup 获取特定数据

3 回答 3

Related

Reference