python - 如何使用标签提取标签内的文本？

Question

我想使用beautifulsoup. 我想在不删除内部 html 标签的情况下提取标签内的文本。例如样本输入：

<a class="fl" href="https://stackoverflow.com/questio...">
    Angular2 <b>Router link not working</b>
</a>

样本输出：

'Angular2 <b>Router link not working</b>'

我试过这个：

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
         Angular2 <b>Router link not working</b>
         </a>'
soup = Beautifulsoup(string, 'html.parser')
print(soup.text)

但它给出了：

'Angular2 Router link not working'

如何在不删除内部标签的情况下提取文本？

score 2 · Accepted Answer

从这里开始，第一个答案可以正常工作。对于这个例子：

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://stackoverflow.com/questio...">
             Angular2 <b>Router link not working</b>
         </a>'
soup = BeautifulSoup(string, 'html.parser')
soup.find('a').encode_contents().decode('utf-8')

它给：

'Angular2 <b>Router link not working</b>'

score 1 · Accepted Answer

您正在从标签“a”中提取所有文本，包括其中的每个标签print(soup.text)。如果您只想获取标签“b”对象，您应该尝试下一步：

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('b')
print(b)
print(type(b))

或者

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('a', class_="fl").find('b')
print(b)
print(type(b))

输出：

<b>Router link not working</b>
<class 'bs4.element.Tag'>

如您所见，它将在 beautifullsoup 对象中返回您的标签“b”

如果您需要字符串格式的数据，您可以编写：

b = soup.find('a', class_="fl").find('b')
b = str(b)
print(b)
print(type(b))

输出：

<b>Router link not working</b>
<class 'str'>

score 0 · Accepted Answer

正如 Den 所说，您需要获取该内部标签，然后将其存储为类型str以包含该内部标签。在 Den 给定的解决方案中，它将专门抓取<b>标签，而不是父标签/文本，如果那里有其他样式类型的标签，则不会。但是如果还有其他标签，你可以更笼统地让它找到<a>标签的子元素，而不是专门寻找<b>标签。

所以基本上这会做的是找到<a>标签并抓取整个文本。然后它将进入该标签的子<a>标签，将其转换为字符串，然后用该字符串（包括标签）替换该父文本中的文本

string = '''<a class="fl" href="https://stackoverflow.com/questio...">
     Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
     </a>'''



from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(string, 'html.parser')
parsed_soup = ''

for item in soup.find_all('a'):
    if type(item) is Tag and 'a' != item.name:
        continue
    else:
        try:
            parent = item.text.strip()
            child_elements = item.findChildren()
            for child_ele in child_elements:
                child_text = child_ele.text
                child_str = str(child_ele)


                parent = parent.replace(child_text, child_str)
        except:
            parent = item.text

print (parent)

输出：

print (parent)
Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>

python - 如何使用标签提取标签内的文本？

3 回答 3

Related

Reference