python - BeautifulSoup：从子节点中提取值

Question

我有以下html：

<td class="section">
    <div style="margin-top:2px; margin-bottom:-10px; ">
    <span class="username"><a href="user.php?id=xx">xxUsername</a></span>
    </div>
    <br>
<span class="comment">
A test comment
</span>
</td>

我只想在 SPAN 标记中检索 xxUsername 和注释文本。到目前为止，我已经这样做了：

results = soup.findAll("td", {"class" : "section"})

它确实获取了我上面提到的模式的所有 html 块。现在我想在一个循环中检索所有子值？可能吗？如果没有，那么我如何获取子节点信息？

score 7 · Accepted Answer

你可以试试这样的。它基本上完成了您在上面所做的事情 - 首先遍历所有section-classed td，然后遍历其中的所有span文本。这会打印出类，以防万一您需要更加严格：

In [1]: from bs4 import BeautifulSoup

In [2]: html = # Your html here

In [3]: soup = BeautifulSoup(html)

In [4]: for td in soup.find_all('td', {'class': 'section'}):
   ...:     for span in td.find_all('span'):
   ...:         print span.attrs['class'], span.text
   ...:         
['username'] xxUsername
['comment'] 
A test comment

或者使用比必要的更复杂的单线，将所有内容存储回您的列表中：

In [5]: results = [span.text for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span')]

In [6]: results
Out[6]: [u'xxUsername', u'\nA test comment\n']

或者在同一个主题上，一个字典，键是类的元组，值是文本本身：

In [8]: results = dict((tuple(span.attrs['class']), span.text) for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span'))

In [9]: results
Out[9]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}

假设这个更接近你想要的，我建议重写为：

In [10]: results = {}

In [11]: for td in soup.find_all('td', {'class': 'section'}):
   ....:     for span in td.find_all('span'):
   ....:         results[tuple(span.attrs['class'])] = span.text
   ....:         

In [12]: results
Out[12]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}

score 1 · Accepted Answer

要从其中一个username或comment <span>元素中获取文本：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
for el in soup('span', ['username', 'comment']):
    print el.string,

输出

xxUsername 
A test comment

python - BeautifulSoup：从子节点中提取值

2 回答 2

输出

Related

Reference