python - 为什么 Python 不能正确显示此文本？（UTF-8 解码问题）

Question

import urllib.request as u

zipcode = str(47401)
url = 'http://watchdog.net/us/?zip=' + zipcode
con = u.urlopen(url)

page = str(con.read())
value3 = int(page.find("<title>")) + 7
value4 = int(page.find("</title>")) - 15
district = str(page[value3:value4])
print(district)
newdistrict = district.replace("\xe2\x80\x99","'")
print(newdistrict)

出于某种原因，我的代码以以下格式提取标题：IN-09: Indiana\xe2\x80\x99s 9th. 我知道\xe字符串是'符号的 unicode，但我不知道如何让 python 用'符号替换那组字符。我试过解码字符串，但它已经是 unicode 并且上面的替换代码没有改变任何东西。关于我做错了什么有什么建议吗？

score 6 · Accepted Answer

当您调用时con.text()，这将返回一个bytes对象。调用str()它会返回一个表示它的字符串 - 因此，如果您未指定编码，则使用转义而不是真实字符。（这意味着你的字符串最终会包含\\xe2\\x80\\x99各种其他不受欢迎的东西。）bytes与 Python 2 中的情况很相似str：它没有存储任何编码信息。str在 Python 3 中就像unicode在 Python 2 中一样；它有编码。所以，当把一个bytes对象变成一个str对象时，你需要告诉它它实际上是在什么编码中的。在这种情况下，就是utf-8.

与其调用str()它，不如使用bytes.decode; 是一样的，只是更整洁。

>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'

此处所做的唯一功能更改是将bytes对象解码为'utf-8'.

score -1 · Accepted Answer

尝试这个

newdistrict = district.encode("**THE_INPUT_STRING_ENCODING**").replace("\\xe2\\x80\\x99","'")

我认为您使用的是 utf-8，所以它应该看起来像这样

newdistrict = district.encode("utf-8").replace("\\xe2\\x80\\x99","'")

但这不是使用 unicode 的正确原因。一旦你的文本被导入到程序中，你应该在所有地方都使用 unicode，除非你输出为输出时应该考虑外部目的地

所以更好的原因是在脚本顶部添加行

# -*- coding: utf-8 -*-

将您输入的内容读取为 utf-8

page = con.read().decode('utf-8')

然后做 newdistrict = District.replace( u"YOUR_UNICODE_STRING" ,"'")

例如

newdistrict = district.replace(u"דכעדחלגעדיל","'")

如需更多帮助，请阅读此内容

http://docs.python.org/howto/unicode.html

python - 为什么 Python 不能正确显示此文本？（UTF-8 解码问题）

2 回答 2

Related

Reference