beautifulsoup - Python BeautifulSoup 不是递归文本

Question

我有一个带有如下代码的 span 元素，我如何提取仅存在于 anchor(a) 标记之外的文本：

# print soup.prettify()
<span class="1">
    text_wanted         
    <a data-toggle="notify" href="https://www.abc.com/1" class="class1"><span>text1</span></a>
    <a data-toggle="notify" href="https://www.abc.com/2" class="class2"><span>text2</span></a>
</span>

我正在考虑以下解决方案：

text_all = soup.text.encode('utf-8')
text_strip_list = [a.text.encode('utf-8').strip() for a in soup.find_all('a')]
for text_strip in text_strip_list:
    text_all = text_all.replace(text_strip, '').strip()

我想知道是否有一种简单的方法来获取所需的文本而不是潜入锚标签..

提前致谢...

score 1 · Accepted Answer

假设html是带有解析 HTML 的 BeautifulSoup 对象，

from BeautifulSoup import NavigableString

print [node for node in html.find('span').contents if type(node) is NavigableString]

将产生最外层内的文本节点span。

beautifulsoup - Python BeautifulSoup 不是递归文本

1 回答 1

Related

Reference