python - How Do You Pass New URLs to a Scrapy Crawler

Question

I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.

The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.

score 0 · Accepted Answer

只要有一个蜘蛛来轮询一个数据库（或文件？），当出现一个新的 URL 时，它会为它创建并产生一个新的 Request() 对象。

您可以很容易地手动构建它。可能有比这更好的方法，但这基本上就是我为开放代理刮板所做的。蜘蛛从数据库中获取所有“潜在”代理的列表，并为每个代理生成一个 Request() 对象 - 当它们返回时，它们会被分派到链中并由下游中间件进行验证，它们的记录通过以下方式更新项目管道。

score 0 · Accepted Answer

您可以使用消息队列（如IronMQ ——完全披露，我为使 IronMQ 作为开发人员传播者的公司工作）来传递 URL。

然后在您的爬虫中，从队列中轮询 URL，并根据您检索到的消息进行爬取。

您链接到的示例可以更新（这是未经测试的伪代码，但您应该了解基本概念）：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ

mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
    msg = q.get(timeout=120) # get messages from queue
                             # timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
    if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
        time.sleep(1) # wait one second
        continue # try again
    spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run() # the script will block here
    q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it

python - How Do You Pass New URLs to a Scrapy Crawler

2 回答 2

Related

Reference