python - 从python脚本调用scrapy而不创建JSON输出文件

Question

这是我用来调用scrapy的python脚本，答案是

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

这是我的 pipelines.py 代码

from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher

class scrapermar11Pipeline(object):


    def __init__(self):
        self.files = {}
        dispatcher.connect(self.spider_opened , signals.spider_opened)
        dispatcher.connect(self.spider_closed , signals.spider_closed)


    def spider_opened(self,spider):
        file = open('links_pipelines.json' ,'wb')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self,spider):
       self.exporter.finish_exporting()
       file = self.files.pop(spider)
       file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        log.msg('It reached here')
        return item

这段代码取自这里

Scrapy :: JSON导出问题

当我像这样运行爬虫时

scrapy crawl MySpider -a start_url='abc'

创建了一个具有预期输出的链接文件。但是当我执行 python 脚本时，它不会创建任何文件，尽管爬虫运行，因为转储的scrapy stats 类似于之前运行的那些。我认为python脚本中有一个错误，因为文件是在第一种方法中创建的。我如何让脚本输出文件？

score 1 · Accepted Answer

这段代码对我有用：

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

score -1 · Accepted Answer

一个对我有用的解决方案是放弃运行脚本和使用内部 API，而是使用命令行和 GNU Parallel 来并行化。

要运行所有已知的蜘蛛，每个核心一个：

scrapy list | parallel --line-buffer scrapy crawl

scrapy list每行列出所有蜘蛛，允许我们将它们作为参数附加到scrapy crawl传递给 GNU Parallel 的命令 () 中。--line-buffer意味着从进程接收到的输出将被混合打印到标准输出，但在逐行的基础上，而不是 quater/half 行一起乱码（对于其他选项，请查看--groupand --ungroup）。

注意：显然这在具有多个 CPU 内核的机器上效果最好，因为默认情况下，GNU Parallel 将每个内核运行一个作业。请注意，与许多现代开发机器不同，廉价的 AWS EC2 和 DigitalOcean 层只有一个虚拟 CPU 内核。因此，如果您希望在一个内核上同时运行作业，您将不得不使用--jobsGNU Parellel 的参数。例如，每个核心运行 2 个 scrapy 爬虫：

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl

python - 从python脚本调用scrapy而不创建JSON输出文件

2 回答 2

Related

Reference