python - 美丽的汤和字符编码

Question

我正在尝试使用 Beautiful Soup 和 Python 2.6.5 从带有斯堪的纳维亚字符的网站中提取文本和 HTML。

html = open('page.html', 'r').read()
soup = BeautifulSoup(html)

descriptions = soup.findAll(attrs={'class' : 'description' })

for i in descriptions:
    description_html = i.a.__str__()
    description_text = i.a.text.__str__()
    description_html = description_html.replace("/subdir/", "http://www.domain.com/subdir/")
    print description_html

但是，在执行时，程序失败并显示以下错误消息：

Traceback (most recent call last):
    File "test01.py", line 40, in <module>
        description_text = i.a.text.__str__()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 19:         ordinal not in range(128)

输入页面似乎以 ISO-8859-1 编码，如果有帮助的话。我尝试设置正确的源编码，BeautifulSoup(html, fromEncoding="latin-1")但它也没有帮助。

现在是 2011 年，我正在努力解决琐碎的字符编码问题，我相信这一切都有一个非常简单的解决方案。

score 5 · Accepted Answer

i.a.__str__('latin-1')

或者

i.a.text.encode('latin-1')

应该管用。

你确定是latin-1吗？它应该正确检测编码。

str(i.a)另外，如果发生不需要指定编码，为什么不直接使用呢？

编辑：看起来您需要安装 chardet才能自动检测编码。

score 0 · Accepted Answer

我遇到了同样的问题，Beautiful Soup 未能输出一行包含德语字符的文本。不幸的是，即使在 stackoverflow 上也有无数的答案并没有解决我的问题：

        title = str(link.contents[0].string)

这给出了 'UnicodeEncodeError: 'ascii codec can't encode character u'\xe4' in position 32: ordinal not in range(128)

许多答案确实对正确的解决方案提供了宝贵的指导。正如 Lennart Regebro 在UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128) 中所说：

当您执行 str(u'\u2013') 时，您正在尝试将 Unicode 字符串转换为 8 位字符串。为此，您需要使用编码，即 Unicode 数据到 8 位数据之间的映射。str() 所做的是使用系统默认编码，在 Python 2 下是 ASCII。ASCII 仅包含 Unicode 的前 127 个代码点，即 \u0000 到 \u007F1。结果是您收到上述错误，ASCII 编解码器只是不知道 \u2013 是什么（顺便说一句，它是一个长破折号）。

对我来说，这是一个不使用 str() 将 Beautiful Soup 对象转换为字符串格式的简单案例。摆弄控制台的默认输出也没有什么区别。

            ### title = str(link.contents[0].string)
            ### should be
            title = link.contents[0].encode('utf-8')

python - 美丽的汤和字符编码

2 回答 2

Related

Reference