2

我正在抓取一个网站并解析一些内容+图像,但是即使对于具有 100 页左右的简单网站,也需要花费数小时才能完成这项工作。我正在使用以下设置。任何帮助将不胜感激。我已经看到了这个问题 - Scrapy 的 Scrapyd 调度蜘蛛太慢,但无法收集太多见解。

EXTENSIONS = {'scrapy.contrib.logstats.LogStats': 1}
LOGSTATS_INTERVAL = 60.0
RETRY_TIMES = 4
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 12
CONCURRENT_ITEMS = 200
DOWNLOAD_DELAY = 0.75
4

1 回答 1

5

Are you sure the website is responding OK?

Setting DOWNLOAD_DELAY = 0.75 will force requests to be sequential and add a delay of 0.75 seconds between them. Your crawl will certainly be faster if you remove this, however, with 12 concurrent requests per domain be careful you are not hitting websites too aggressively.

Even with the delay it should not take hours, so that's why I am wondering if the website is slow or unresponsive. Some websites will do this to bots.

于 2012-08-14T13:20:38.827 回答