python - UnicodeDecodeError Python 错误

Question

我正在尝试编写python google api。遇到一些 unicode 问题。到目前为止，我真正的基本 PoC 是：

#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup        
query = "filetype%3Apdf"
url = "http://www.google.com/search?sclient=psy-ab&hl=en&site=&source=hp&q="+query+"&btnG=Search"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
data = response.read()
data = data.decode('UTF-8', 'ignore')
data = data.encode('UTF-8', 'ignore')
soup = BeautifulSoup(data)
print u""+soup.prettify('UTF-8')

我的回溯是：

Traceback (most recent call last):
  File "./google.py", line 22, in <module>
print u""+soup.prettify('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 48786: ordinal not in range(128)

有任何想法吗？

score 4 · Accepted Answer

您正在将soup树转换为UTF-8（编码的字节字符串），然后尝试将其连接到空的u'' unicode字符串。

Python 将使用默认编码（即）自动尝试解码ASCII您的编码字节字符串，但无法解码UTF-8数据。

您需要显式解码prettify()输出：

print u"" + soup.prettify('UTF-8').decode('UTF-8')

Python Unicode HOWTO更好地解释了这一点，包括默认编码。我真的，真的建议你也阅读 Joel Spolsky 的关于 Unicode 的文章。

python - UnicodeDecodeError Python 错误

1 回答 1

Related

Reference