python - BeautifulSoup 无法识别 UTF-8 字符，即使使用“fromEncoding=UTF-8”

Question

我编写了一个简单的脚本，它只需要一个网页并将其内容提取到一个标记化列表中。但是，我遇到了一个问题，当我将 BeautifulSoup 对象转换为字符串时，“，”等的 UTF-8 字符将不会转换。相反，它们仍保持 unicode 格式。

当我创建 BeautifulSoup 对象时，我将源定义为 UTF-8，我什至尝试单独运行 unicode 转换，但没有任何效果。有人知道为什么会这样吗？

from urllib2 import urlopen
from bs4 import BeautifulSoup
import nltk, re, pprint

url = "http://www.bloomberg.com/news/print/2013-07-05/softbank-s-21-6-billion-bid-for-    sprint-approved-by-u-s-.html"
raw = urlopen(url).read()
soup = BeautifulSoup(raw, fromEncoding="UTF-8")
result = soup.find_all(id="story_content")
str_result = str(result)
notag = re.sub("<.*?>", " ", str_result)
output = nltk.word_tokenize(notag)
print(output)

score 3 · Accepted Answer

3

于 2013-07-06T15:06:32.993 回答

python - BeautifulSoup 无法识别 UTF-8 字符，即使使用“fromEncoding=UTF-8”

1 回答 1

Related

Reference