python - 逃避……用 BeautifulSoup

Question

我目前正在使用 BeautifulSoup 来抓取一些网站，但是我对某些特定字符有疑问，UnicodeDammit 中的代码似乎（再次）表明这是一些微软发明的。

我正在使用最新版本的 BeautifulSoup(3.0.8.1)，因为我仍在使用 python2.5

以下代码说明了我的问题：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version&hellip;')
print soup

'...Baby One More Time (Digital Deluxe Version&hellip;'

如您所见，问题在于末尾的 '...'(&hellip) 字符（您的浏览器可能正确转义了该字符）。显然这不是我感兴趣的。

有这个字符的 unicode 表示或其他东西会很好。即使只是简单地忽略它也会解决我的特殊问题。

我怎么能用 BeautifulSoup 做到这一点？

score 1 · Accepted Answer

自己找到了解决方案：

soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version&hellip;', convertEntities="html")

score 1 · Accepted Answer

MS 可能发明了它，但它…是 HTML 4 的一部分：http: //www.w3.org/TR/REC-html40/sgml/entities.html

也许您Lib/htmlentitydefs.py丢失或过时，因为这就是 BeautifulSoup 用来转换实体的方法。

如果您查看Python 2.5 源代码树，您会清楚地看到它在第 126 行定义。

python - 逃避……用 BeautifulSoup

2 回答 2

Related

Reference