python - 如何使用 BeautifulSoup 4 替换或删除 HTML 实体，如“”

Question

我正在使用 Python 和 BeautifulSoup 4 库处理 HTML，但找不到 用空格替换的明显方法。相反，它似乎被转换为 Unicode 不间断空格字符。

我错过了一些明显的东西吗？最好的替换方法是什么？使用 BeautifulSoup 的正常空间？

编辑添加我使用的是最新版本 BeautifulSoup 4，因此convertEntities=BeautifulSoup.HTML_ENTITIESBeautiful Soup 3 中的选项不可用。

score 29 · Accepted Answer

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

score 17 · Accepted Answer

请参阅文档中的实体。BeautifulSoup 4 为所有实体生成正确的 Unicode：

传入的 HTML 或 XML 实体总是被转换为相应的 Unicode 字符。

是的， 变成了一个不间断的空格字符。如果您真的希望这些是空格字符，则必须进行 unicode 替换。

score 13 · Accepted Answer

您可以简单地将不间断空格 unicode 替换为普通空格。

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')

一个好处是，即使您使用的是 BeautifulSoup，也不需要这样做。

score 3 · Accepted Answer

我遇到了 soup.prettify() 无法修复的 json 问题，因此它与unicodedata.normalize()一起使用：

import unicodedata
soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")

date prints fine:'03 Nov 19 17:51'
json:"03\u00a0Nov\u00a019\u00a017:51"
json after normalizing:'"03 Nov 19 17:51"'

score 2 · Accepted Answer

诚然，这不是使用 BeautifulSoup，但今天更直接的解决方案可能是和的某种组合html.unescape，unicodedata.normalize具体取决于您的数据和您想要做什么。

>>> from html import unescape
>>> s = unescape('An enthusiastic member of the&nbsp;community')# Using the import here
>>> print(s)
>>> 'An enthusiastic member of the\xa0community'
>>> import unicodedata
>>> s = unicodedata.normalize('NFKC', s)
>>> print(s)
>>> 'An enthusiastic member of the community'

python - 如何使用 BeautifulSoup 4 替换或删除 HTML 实体，如“”

5 回答 5

Related

Reference