以前,在 python 2.6 中,我大量使用 urllib.urlopen 来捕获网页内容,然后对收到的数据进行后期处理。现在,这些例程以及我试图用于 python 3.2 的新例程正在运行到似乎只是一个 Windows 的东西(甚至可能只是 Windows 7 的问题)。
在 Windows 7 上使用以下代码和 python 3.2.2 (64) ...
import urllib.request
fp = urllib.request.urlopen(URL_string_that_I_use)
string = fp.read()
fp.close()
print(string.decode("utf8"))
我收到以下消息:
Traceback (most recent call last):
File "TATest.py", line 5, in <module>
string = fp.read()
File "d:\python32\lib\http\client.py", line 489, in read
return self._read_chunked(amt)
File "d:\python32\lib\http\client.py", line 553, in _read_chunked
self._safe_read(2) # toss the CRLF at the end of the chunk
File "d:\python32\lib\http\client.py", line 592, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
改用以下代码...
import urllib.request
fp = urllib.request.urlopen(URL_string_that_I_use)
for Line in fp:
print(Line.decode("utf8").rstrip('\n'))
fp.close()
我得到了相当多的网页内容,但是其余的捕获被......
Traceback (most recent call last):
File "TATest.py", line 9, in <module>
for Line in fp:
File "d:\python32\lib\http\client.py", line 489, in read
return self._read_chunked(amt)
File "d:\python32\lib\http\client.py", line 545, in _read_chunked
self._safe_read(2) # toss the CRLF at the end of the chunk
File "d:\python32\lib\http\client.py", line 592, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
尝试阅读另一页会产生...
Traceback (most recent call last):
File "TATest.py", line 11, in <module>
print(Line.decode("utf8").rstrip('\n'))
File "d:\python32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position
21: character maps to <undefined>
我确实相信这是一个 Windows 问题,但是 python 可以变得更强大来处理导致它的原因吗?在 Linux 上尝试类似代码(2.6 版代码)时,我们没有遇到问题。有没有解决的办法?我还发布到 gmane.comp.python.devel 新闻组