0

目前我有一个脚本可以从 Reddit 的首页下载头条新闻,它几乎总是有效的。偶尔我会收到以下异常。我知道我应该插入tryexcept语句来保护我的代码,但是我应该把它们放在哪里呢?

爬行:

def crawlReddit():                                                     
    r = praw.Reddit(user_agent='challenge')             # PRAW object
    topHeadlines = []                                   # List of headlines 
    for item in r.get_front_page():
        topHeadlines.append(item)                       # Add headlines to list
    return topHeadlines[0].title                            # Return top headline

def main():
    headline = crawlReddit()                            # Pull top headline

if __name__ == "__main__":
    main()              

错误:

Traceback (most recent call last):
  File "makecall.py", line 57, in <module>
    main()                                      # Run
  File "makecall.py", line 53, in main
    headline = crawlReddit()                            # Pull top headline
  File "makecall.py", line 34, in crawlReddit
    for item in r.get_front_page():
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/__init__.py", line 480, in get_content
    page_data = self.request_json(url, params=params)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/decorators.py", line 161, in wrapped
    return_value = function(reddit_session, *args, **kwargs)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/__init__.py", line 519, in request_json
    response = self._request(url, params, data)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/__init__.py", line 383, in _request
    _raise_response_exceptions(response)
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/praw/internal.py", line 172, in _raise_response_exceptions
    response.raise_for_status()
  File "/Users/myusername/Documents/dir/lib/python2.7/site-packages/requests/models.py", line 831, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable
4

1 回答 1

1

看起来r.get_front_page()返回一个懒惰评估的对象,您只需要该对象的第一个元素。如果是这样,请尝试以下操作:

import time

def crawlReddit():                                                     
    r = praw.Reddit(user_agent='challenge')             # PRAW object
    front_page = r.get_front_page()
    try:
        first_headline = front_page.next() # Get the first item from front_page
    except HTTPError:
        return None
    else:
        return first_headline.title


def main():
    max_attempts = 3
    attempts = 1
    headline = crawlReddit()
    while not headline and attempts < max_attempts:
        time.sleep(1)  # Make the program wait a bit before resending request
        headline = crawlReddit()
        attempts += 1
    if not headline:
        print "Request failed after {} attempts".format(max_attempts)


if __name__ == "__main__":
    main()

编辑代码现在最多尝试访问数据 3 次,失败尝试之间的间隔为一秒。第三次尝试后它放弃了。服务器可能离线等。

于 2015-01-20T04:30:17.380 回答