python - 来自 urlopen 的乱码

Question

我正在尝试从下面代码中的地址读取一些 utf-8 文件。它适用于大多数文件，但对于某些文件，urllib2（和 urllib）无法读取。

显而易见的答案是第二个文件已损坏，但奇怪的是 IE 完全没有问题地读取它们。该代码已在 XP 和 Linux 上进行了测试，结果相同。有什么建议吗？

import urllib2
#This works:
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt")
line=f.readline()
print "this works: %s)" %(line)
line=unicode(line,'utf-8') #... works fine

#This doesn't
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
line=f.readline()
print "this doesn't: %s)" %(line)
line=unicode(line,'utf-8')#...causes an exception:

score 2 · Accepted Answer

>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
>>> f.headers.dict
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}

要么设置一个标头来阻止站点发送 gzip 编码的响应，要么先对其进行解码。

score 0 · Accepted Answer

您要求的 URL 似乎是指私人缓存。试试http://www.gutenberg.org/files/144/144-0.txt（可在http://www.gutenberg.org/ebooks/144找到）。

如果您真的想使用/cache/URL：服务器正在向您发送 gzip 压缩数据，而不是 unicode。urllib2不要求 gzipped 数据并且不对其进行解码，这是正确的行为。请参阅此问题以了解如何解压缩它。

score -1 · Accepted Answer

你知道这不是一个解决方案，但你应该看看http://pypi.python.org/pypi/requests库，不管你是否还想使用 urllib 都可以查看 Requests 的源代码，以了解它是如何与 utf 一起工作的-8 个字符串。

python - 来自 urlopen 的乱码

3 回答 3

Related

Reference