3

我正在使用具有子标签的 HTML 元素,我想“忽略”或删除这些标签,以便文本仍然存在。刚才,如果我尝试使用.string任何带有标签的元素,我得到的只是None.

import bs4

soup = bs4.BeautifulSoup("""
    <div id="main">
      <p>This is a paragraph.</p>
      <p>This is a paragraph <span class="test">with a tag</span>.</p>
      <p>This is another paragraph.</p>
    </div>
""")

main = soup.find(id='main')
for child in main.children:
    print child.string

输出:

This is a paragraph.
None
This is another paragraph.

我希望第二行是This is a paragraph with a tag.. 我该怎么做呢?

4

2 回答 2

5
for child in soup.find(id='main'):
    if isinstance(child, bs4.Tag):
        print child.text

而且,你会得到:

This is a paragraph.
This is a paragraph with a tag.
This is another paragraph.
于 2013-08-16T19:16:15.550 回答
0

改用.strings可迭代的。用于''.join()拉入所有字符串并将它们连接在一起:

print ''.join(main.strings)

迭代.strings产生每个包含的字符串,直接或在子标签中。

演示:

>>> print ''.join(main.strings)

This is a paragraph. 
This is a paragraph with a tag. 
This is another paragraph. 
于 2013-08-16T19:15:12.857 回答