2

I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.

The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.

4

2 回答 2

0

只要有一个蜘蛛来轮询一个数据库(或文件?),当出现一个新的 URL 时,它会为它创建并产生一个新的 Request() 对象。

您可以很容易地手动构建它。可能有比这更好的方法,但这基本上就是我为开放代理刮板所做的。蜘蛛从数据库中获取所有“潜在”代理的列表,并为每个代理生成一个 Request() 对象 - 当它们返回时,它们会被分派到链中并由下游中间件进行验证,它们的记录通过以下方式更新项目管道。

于 2013-05-23T02:53:46.593 回答
0

您可以使用消息队列(如IronMQ ——完全披露,我为使 IronMQ 作为开发人员传播者的公司工作)来传递 URL。

然后在您的爬虫中,从队列中轮询 URL,并根据您检索到的消息进行爬取。

您链接到的示例可以更新(这是未经测试的伪代码,但您应该了解基本概念):

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ

mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
    msg = q.get(timeout=120) # get messages from queue
                             # timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
    if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
        time.sleep(1) # wait one second
        continue # try again
    spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run() # the script will block here
    q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it
于 2013-05-24T13:40:38.737 回答