python - 从 Python 运行 Scrapy

Question

我正在尝试从 Python 运行 Scrapy。我正在查看这段代码（来源）：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

我的问题是我对如何调整这段代码来运行我自己的蜘蛛感到困惑。我将我的蜘蛛项目称为“spider_a”，它指定了要在蜘蛛本身中爬行的域。

我要问的是，如果我使用以下代码运行我的蜘蛛：

scrapy crawl spider_a

如何调整上面的示例 python 代码来做同样的事情？

score 2 · Accepted Answer

只需将其导入并传递给crawler.crawl()，例如：

from testspiders.spiders.spider_a import MySpider

spider = MySpider()
crawler.crawl(spider)

score 1 · Accepted Answer

在 Scrapy 0.19.x（可能适用于旧版本）中，您可以执行以下操作。

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

您甚至可以直接从如下脚本调用命令：

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

看看我的回答here。我更改了官方文档，所以现在您的爬虫使用您的设置并可以生成输出。

python - 从 Python 运行 Scrapy

2 回答 2

Related

Reference