我想抓取一个网址中带有德语变音符号的网站。这是我在 python 3.3 中的代码,它工作得很好,没有任何变音符号。
def numResults(keyword):
try:
page_google = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=' +keyword
print(page_google)
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
soup = BeautifulSoup(html_google)
except URLError as e:
print(e)
return soup
但是当我要求类似的东西时:
print(numResults('älterer'))
我收到以下错误,因为 urllib 无法处理我猜的变音符号:
Traceback (most recent call last):
File "C:\Users\zwieback86\Desktop\programming\scrape.py", line 137, in <module>
print(numResults('älterer'))
File "C:\Users\zwieback86\Desktop\programming\scrape.py", line 73, in numResults
html_google = urlopen(req_google).read()
File "c:\python33\lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "c:\python33\lib\urllib\request.py", line 469, in open
response = self._open(req, data)
File "c:\python33\lib\urllib\request.py", line 487, in _open
'_open', req)
File "c:\python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "c:\python33\lib\urllib\request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "c:\python33\lib\urllib\request.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "c:\python33\lib\http\client.py", line 1061, in request
self._send_request(method, url, body, headers)
File "c:\python33\lib\http\client.py", line 1089, in _send_request
self.putrequest(method, url, **skips)
File "c:\python33\lib\http\client.py", line 953, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 38: ordinal not in range(128)
当我在浏览器中输入地址“ http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q =älterer”时,我得到了想要的页面。
所以我假设 urllib 无法处理 url 中带有变音符号的请求。但是我该如何解决它会接受德国变音符号?更改变音符号如 ä -> ae 不是一种选择。
非常感谢和问候!