1

以前,在 python 2.6 中,我大量使用 urllib.urlopen 来捕获网页内容,然后对收到的数据进行后期处理。现在,这些例程以及我试图用于 python 3.2 的新例程正在运行到似乎只是一个 Windows 的东西(甚至可能只是 Windows 7 的问题)。

在 Windows 7 上使用以下代码和 python 3.2.2 (64) ...

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("utf8"))

我收到以下消息:

Traceback (most recent call last):
  File "TATest.py", line 5, in <module>
    string = fp.read()
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 553, in _read_chunked
    self._safe_read(2)      # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

改用以下代码...

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)
for Line in fp:
    print(Line.decode("utf8").rstrip('\n'))
fp.close()

我得到了相当多的网页内容,但是其余的捕获被......

Traceback (most recent call last):
  File "TATest.py", line 9, in <module>
    for Line in fp:
  File "d:\python32\lib\http\client.py", line 489, in read
    return self._read_chunked(amt)
  File "d:\python32\lib\http\client.py", line 545, in _read_chunked
    self._safe_read(2)  # toss the CRLF at the end of the chunk
  File "d:\python32\lib\http\client.py", line 592, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)

尝试阅读另一页会产生...

Traceback (most recent call last):
  File "TATest.py", line 11, in <module>
    print(Line.decode("utf8").rstrip('\n'))
  File "d:\python32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position
21: character maps to <undefined>

我确实相信这是一个 Windows 问题,但是 python 可以变得更强大来处理导致它的原因吗?在 Linux 上尝试类似代码(2.6 版代码)时,我们没有遇到问题。有没有解决的办法?我还发布到 gmane.comp.python.devel 新闻组

4

1 回答 1

2

看起来您正在阅读的页面被编码为cp1252.

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read()
fp.close()
print(string.decode("cp1252"))

应该管用。

有很多方法可以指定内容的字符集,但对于大多数页面来说,使用 HTTP 标头就足够了:

import urllib.request

fp = urllib.request.urlopen(URL_string_that_I_use)

string = fp.read().decode(fp.info().get_content_charset())
fp.close()
print(string)
于 2014-06-30T10:56:25.877 回答