python - 提取内容
存在其他标签时的标签

Question

我想使用漂亮的汤提取标签中的文本和以下html中的文本：

<p><i>Italic stuff</i> Not Italic stuff</p>

所以我愿意

soup = BeautifulSoup('<p><i>Italic stuff</i> Not Italic stuff</p>')
ital = soup.i.string
notital = soup.string

但是soup.string返回None，而不是'Not Italic的东西......我做错了什么？

谢谢！

score 1 · Accepted Answer

从.string属性的文档：

如果此标记有一个字符串子项，则返回值是该字符串。如果此标签没有子标签，或者有多个子标签，则返回值为 None。如果此标签有一个子标签，则返回值为子标签的“字符串”属性，递归。

您似乎需要的是提取i元素的尾部文本，如this answer所示：

In [12]: soup.i.findNextSibling(text=True)
Out[12]: u' Not Italic stuff'

score -1 · Accepted Answer

删除标签的实用功能

def strip_tags(html, invalid_tags):
   soup = BeautifulSoup(html)

   for tag in soup.findAll(True):
      if tag.name in invalid_tags:
         s = ""
       for c in tag.contents:
           if not isinstance(c, NavigableString):
               c = strip_tags(unicode(c), invalid_tags)
           s += unicode(c)

       tag.replaceWith(s)

 return soup

使用方法相应地删除标签

        Invalid tags which we want to remove from the content
        invalid_tags = ['p', 'div', 'a', 'strong', 'img', 'span', 'br', 'h1', 'h2', 'h3', 'h5', 'h6', 'em']

python - 提取内容存在其他标签时的标签

2 回答 2

删除标签的实用功能

使用方法相应地删除标签

Related

Reference

python - 提取内容
存在其他标签时的标签