python - UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'

Question

我正在学习 urllib2 和 Beautiful Soup，在第一次测试中遇到如下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

似乎有很多关于此类错误的帖子，我尝试了我能理解的解决方案，但似乎有 22 个问题，例如：

我想打印post.text（文本是一个漂亮的汤方法，只返回文本）。 str(post.text)并post.text产生 unicode 错误（在右撇号'和之类的东西上...）。

所以我在post = unicode(post)上面添加str(post.text)，然后我得到：

AttributeError: 'unicode' object has no attribute 'text'

我也试过(post.text).encode()和(post.text).renderContents()。后者产生错误：

AttributeError: 'unicode' object has no attribute 'renderContents'

然后我尝试str(post.text).renderContents()并得到了错误：

AttributeError: 'str' object has no attribute 'renderContents'

如果我可以在文档顶部的某个地方定义'make this content 'interpretable''并且仍然可以访问所需的text功能，那就太好了。

更新： 建议后：

如果我在post = post.decode("utf-8")上面添加，str(post.text)我得到：

TypeError: unsupported operand type(s) for -: 'str' and 'int'

如果我在post = post.decode()上面添加，str(post.text)我得到：

AttributeError: 'unicode' object has no attribute 'text'

如果我在post = post.encode("utf-8")上面添加，(post.text)我得到：

AttributeError: 'str' object has no attribute 'text'

我试过print post.text.encode('utf-8')并得到：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

为了尝试可能有用的东西，我从这里为 Windows 安装了 lxml并通过以下方式实现它：

parsed_content = BeautifulSoup(original_content, "lxml")

根据http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters。

这些步骤似乎没有什么不同。

我正在使用 Python 2.7.4 和 Beautiful Soup 4。

解决方案：

在对 unicode、utf-8 和 Beautiful Soup 类型有了更深入的了解之后，这与我的打印方法有关。我删除了我所有的str方法和连接，例如str(something) + post.text + str(something_else)，所以它是something, post.text, something_else并且它似乎打印得很好，除了我在这个阶段对格式的控制较少（例如在处插入空格,）。

score 46 · Accepted Answer

在 Python 2 中，unicode对象只有在可以转换为 ASCII 时才能打印。如果它不能用 ASCII 编码，你会得到那个错误。您可能希望对其进行显式编码，然后打印结果str：

print post.text.encode('utf-8')

score 2 · Accepted Answer

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

为我工作;-)

score 0 · Accepted Answer

你试过了吗.decode()？.decode("utf-8")

而且，我建议使用lxmlusinghtml5lib parser

http://lxml.de/html5parser.html

python - UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'

3 回答 3

Related

Reference