python - 美丽的汤，收到警告，然后在代码中途出错

Question

我正在遍历每个处理日期（1 月 1 日、1 月 2 日、......、12 月 31 日）的维基百科页面。在每一页上，我都列出了那天过生日的人的名字。但是，在我的代码进行到一半时（4 月 27 日），我收到了以下警告：

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

然后，我马上得到一个错误：

Traceback (most recent call last):
    File "wikipedia.py", line 29, in <module>
        section = soup.find('span', id='Births').parent
AttributeError: 'NoneType' object has no attribute 'parent'

基本上，我无法弄清楚为什么在我一直到 4 月 27 日之后，它决定抛出这个警告和错误。这是4月27日的页面：

4月27日...

据我所知，没有什么不同可以让这种情况发生。还有一个 id="Births" 的跨度。

这是我调用所有这些东西的代码：

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)    
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)

    section = soup.find('span', id='Births').parent
    births = section.find_next('ul').find_all('li')

    for x in births:
        #All the regex and parsing, don't think it's necessary to show

该错误在以下行中引发：

section = soup.find('span', id='Births').parent

到 4 月 27 日时，我确实掌握了很多信息（8 个列表，每个列表约 35,000 个元素），但我认为这不是问题。如果有人有任何想法，我将不胜感激。谢谢

score 4 · Accepted Answer

看起来 Wikipedia 服务器正在提供压缩后的页面：

>>> page.info().get('Content-Encoding')
'gzip'

在您的请求中不应该没有接受编码标头，但是，这就是与其他人的服务器一起工作时的生活。

有很多资源展示了如何处理压缩数据 - 这里有一个： http ://www.diveintopython.net/http_web_services/gzip_compression.html

还有一个： python urllib2 会自动解压缩从网页获取的 gzip 数据吗？

python - 美丽的汤，收到警告，然后在代码中途出错

1 回答 1

Related

Reference