beautifulsoup - 转换
为结束行

Question

我正在尝试使用BeautifulSoup. 我get_text()为此目的使用函数。

我的问题是文本包含</br>标签，我需要将它们转换为结束行。我怎样才能做到这一点？

score 74 · Accepted Answer

您可以使用 BeautifulSoup 对象本身或它的任何元素来执行此操作：

for br in soup.find_all("br"):
    br.replace_with("\n")

score 57 · Accepted Answer

正如官方文档所说：

您可以指定一个字符串用于将文本位连接在一起：soup.get_text("\n")

score 5 · Accepted Answer

正则表达式应该可以解决问题。

import re
s = re.sub('<br\s*?>', '\n', yourTextHere)

希望这可以帮助！

score 2 · Accepted Answer

添加到 Ian 和 dividebyzero 的帖子/评论中，您可以这样做以一次有效地过滤/替换许多标签：

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.replace_with(elem.text + "\n\n")

score 1 · Accepted Answer

与其用 \n 替换标签，不如在所有重要标签的末尾添加一个 \n 可能会更好。

从@petezurich 窃取列表：

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.append('\n')

score 0 · Accepted Answer

如果你打电话element.text，你会得到没有 br 标签的文本。也许您需要为此目的定义自己的自定义方法：

     def clean_text(elem):
        text = ''
        for e in elem.descendants:
            if isinstance(e, str):
                text += e.strip()
            elif e.name == 'br' or e.name == 'p':
                text += '\n'
        return text

    # get page content
    soup = BeautifulSoup(request_response.text, 'html.parser')
    # get your target element
    description_div = soup.select_one('.description-class')
    # clean the data
    print(clean_text(description_div))

score 0 · Accepted Answer

您也可以使用 ‍‍‍<code>get_text(separator = '\n', strip = True) ：

from bs4 import BeautifulSoup
bs=BeautifulSoup('<td>some text<br>some more text</td>','html.parser')
text=bs.get_text(separator = '\n', strip = True)
print(text)
 >> 
some text
some more text

这个对我有用。

beautifulsoup - 转换为结束行

7 回答 7

Related

Reference

beautifulsoup - 转换
为结束行