我只是在玩 urllib2 和带有 utf-8 的页面。
http://www.columbia.edu/~fdc/utf8/
仅获取前 700 个字节(顶部段)
>>> import urllib2
>>> from urllib2 import HTTPError, URLError
>>> import BaseHTTPServer
>>> opener = urllib2.OpenerDirector()
>>> opener.add_handler(urllib2.HTTPHandler())
>>> opener.add_handler(urllib2.HTTPDefaultErrorHandler())
>>> response = opener.open('http://www.columbia.edu/~fdc/utf8/')
>>> content = response.read(700)
现在从这里开始,我认为内容 var 中的字符串将是 utf-8 编码的,并且应该显示得很好。
然而
>>> content
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n<BASE href="http://kermit.columbia.edu">\n<META http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>UTF-8 Sampler</title>\n</head>\n<body bgcolor="#ffffff" text="#000000">\n<h1><tt>UTF-8 SAMPLER</tt></h1>\n\n<big><big> \xc2\xa5 \xc2\xb7 \xc2\xa3 \xc2\xb7 \xe2\x82\xac \xc2\xb7 $ \xc2\xb7 \xc2\xa2 \xc2\xb7 \xe2\x82\xa1 \xc2\xb7 \xe2\x82\xa2 \xc2\xb7 \xe2\x82\xa3 \xc2\xb7 \xe2\x82\xa4 \xc2\xb7 \xe2\x82\xa5 \xc2\xb7 \xe2\x82\xa6 \xc2\xb7 \xe2\x82\xa7 \xc2\xb7 \xe2\x82\xa8 \xc2\xb7 \xe2\x82\xa9 \xc2\xb7 \xe2\x82\xaa \xc2\xb7 \xe2\x82\xab \xc2\xb7 \xe2\x82\xad \xc2\xb7 \xe2\x82\xae \xc2\xb7 \xe2\x82\xaf \xc2\xb7 ₹</big></big>\n\n\n\n<p>\n<blockquote>\nFrank da Cruz<br>\n<a hre'
似乎 html 转义了,所以
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape(content)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 390, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
所以我不明白。我什至尝试过 .encode('utf-8') 不转义,但类似的错误。
从网站显示 utf-8 内容的最佳方式是什么?