python - 加快 HTTP 请求 python 和 500 错误

Question

我有一个代码，它使用查询和时间框架（可能长达一年）从这份报纸中检索新闻结果。

结果每页最多分页 10 篇文章，由于我找不到增加它的方法，我为每个页面发出请求，然后检索每篇文章的标题、网址和日期。每个周期（HTTP 请求和解析）需要 30 秒到一分钟，这非常慢。最终它将以 500 的响应代码停止。我想知道是否有办法加快它的速度，或者一次发出多个请求。我只是想检索所有页面中的文章详细信息。这是代码：

    import requests
    import re
    from bs4 import BeautifulSoup
    import csv

    URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'


    def run(**params):
        countryFile = open("EgyptDaybyDay.csv","a")
        i=1
        results = True
        while results:
                    params["index"]=str(i)
                    response = requests.get(URL.format(**params))
                    print response.status_code
                    htmlFile = BeautifulSoup(response.content)
                    articles = htmlFile.findAll("div", { "class" : "newslist" })

                    for article in articles:
                                url =  (article.a['href']).encode('utf-8','ignore')
                                title = (article.img['alt']).encode('utf-8','ignore')
                                dateline = article.find("div",{"class": "floatright"})
                                m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
                                date =  m.group(1)
                                w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
                                w.writerow((date, title, url ))

                    if not articles:
                                results = False
                    i+=1
        countryFile.close()


    run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

score 1 · Accepted Answer

最可能减慢速度的是服务器，因此并行化 http 请求将是使代码运行得更快的最佳方法，尽管您几乎无法加快服务器响应速度。IBM有一个很好的教程可以做到这一点

score 1 · Accepted Answer

这是试用gevent的好机会。

您应该为 request.get 部分设置一个单独的例程，这样您的应用程序就不必等待 IO 阻塞。

然后，您可以生成多个工作人员并有队列来传递请求和文章。也许与此类似：

import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()

MAX_REQUESTS = 10

requests = Queue(MAX_REQUESTS)
articles = Queue()

mock_responses = range(100)
mock_responses.reverse()

def request():
    print "worker started"
    while True:
        print "request %s" % requests.get()
        sleep(1)

        try:
            articles.put('article response %s' % mock_responses.pop())
        except IndexError:
            articles.put(StopIteration)
            break

def run():
    print "run"

    i = 1
    while True:
        requests.put(i)
        i += 1

if __name__ == '__main__':
    for worker in range(MAX_REQUESTS):
        gevent.spawn(request)

    gevent.spawn(run)
    for article in articles:
        print "Got article: %s" % article

score 0 · Accepted Answer

This might very well come close to what you're looking for.

Ideal method for sending multiple HTTP requests over Python? [duplicate]

Source code: https://github.com/kennethreitz/grequests

score 0 · Accepted Answer

您可以尝试异步进行所有调用。

看看这个： http: //pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html

您也可以使用 gevent 而不是扭曲，但只是告诉您选项。

score 0 · Accepted Answer

在我看来，您正在寻找该报纸不做广告的提要。然而，这是一个以前已经解决的问题——有许多网站会为您生成任意网站的提要，因此至少可以解决您的一个问题。其中一些需要一些人工指导，而另一些则调整的机会较少并且更加自动化。

如果您完全可以避免自己进行分页和解析，我会推荐它。如果你不能，gevent为了简单起见，我支持使用。也就是说，如果他们将您退回 500 个，您的代码可能就不那么成问题了，并且增加并行性可能无济于事。

python - 加快 HTTP 请求 python 和 500 错误

5 回答 5

Related

Reference