python - Python，获取html文档的文本值

Question

我的问题很简单，我有一个包含 html 标签的字符串，我只想从该字符串中获取实际的文本值，例如：

html字符串：

<strong><p> hello </p><p> world </p></strong>

文本值：你好世界

有没有可以做到这一点的功能？

score 3 · Accepted Answer

您可以使用BeautifulSoup的get_text()功能：

from bs4 import BeautifulSoup


text = "<strong><p> hello </p><p> world </p></strong>"

soup = BeautifulSoup(text)
print soup.get_text()  # prints " hello  world "

或者，您可以使用nltk：

import nltk


text = "<strong><p> hello </p><p> world </p></strong>"
print nltk.clean_html(text)  # prints "hello world"

另一种选择是使用html2text，但它的行为有点不同：例如strong被替换为*.

另请参阅相关线程：使用 Python 从 HTML 文件中提取文本

希望有帮助。

python - Python，获取html文档的文本值

1 回答 1

Related

Reference