84

我是 Scrapy 的新手,我正在寻找一种从 Python 脚本运行它的方法。我发现 2 个来源可以解释这一点:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我不知道应该把蜘蛛代码放在哪里以及如何从主函数中调用它。请帮忙。这是示例代码:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

谢谢你。

4

8 回答 8

93

所有其他答案均参考 Scrapy v0.x。根据更新的文档,Scrapy 1.0 要求:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
于 2015-07-13T01:39:37.167 回答
19

我们可以简单地使用

from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName

process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()

__init__在具有全局范围的蜘蛛函数中使用这些参数。

于 2019-05-31T08:12:31.357 回答
16

虽然我没有尝试过,但我认为可以在scrapy 文档中找到答案。直接引用它:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

据我所知,这是图书馆的一项新发展,它使一些早期的在线方法(例如问题中的方法)过时了。

于 2013-01-10T21:16:36.223 回答
13

在scrapy 0.19.x 中你应该这样做:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

注意这些行

settings = get_project_settings()
crawler = Crawler(settings)

没有它,您的蜘蛛将不会使用您的设置,也不会保存项目。我花了一段时间才弄清楚为什么文档中的示例没有保存我的项目。我发送了一个拉取请求来修复文档示例。

这样做的另一种方法是直接从您的脚本中调用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

从我在这里的第一个答案中复制了这个答案: https ://stackoverflow.com/a/19060485/1402286

于 2013-09-27T21:45:55.187 回答
11

当需要在一个python脚本中运行多个爬虫时,需要谨慎处理reactor停止,因为reactor只能停止一次,不能重新启动。

但是,我在做我的项目时发现使用

os.system("scrapy crawl yourspider")

是最简单的。这将使我免于处理各种信号,尤其是当我有多个蜘蛛时。

如果性能是一个问题,您可以使用多处理来并行运行您的蜘蛛,例如:

def _crawl(spider_name=None):
    if spider_name:
        os.system('scrapy crawl %s' % spider_name)
    return None

def run_crawler():

    spider_names = ['spider1', 'spider2', 'spider2']

    pool = Pool(processes=len(spider_names))
    pool.map(_crawl, spider_names)
于 2014-12-02T05:01:16.133 回答
1

这是对 Scrapy 在使用 crawlerprocess 运行时抛出错误的改进

https://github.com/scrapy/scrapy/issues/1904#issuecomment-205331087

首先创建您常用的蜘蛛以成功运行命令行。它应该运行并导出数据或图像或文件非常重要

结束后,就像粘贴在我的程序中蜘蛛类定义上方和 __name __ 下方一样调用设置。

它将获得“from scrapy.utils.project import get_project_settings”未能做到的必要设置,这是许多人推荐的

上面和下面的部分应该在一起。只有一个不跑。Spider 将在 scrapy.cfg 文件夹中运行,而不是在任何其他文件夹中运行

版主可展示树状图以供参考

#Tree
[enter image description here][1]

#spider.py
import sys
sys.path.append(r'D:\ivana\flow') #folder where scrapy.cfg is located

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from flow import settings as my_settings

#----------------Typical Spider Program starts here-----------------------------

          spider class definition here

#----------------Typical Spider Program ends here-------------------------------

if __name__ == "__main__":

    crawler_settings = Settings()
    crawler_settings.setmodule(my_settings)

    process = CrawlerProcess(settings=crawler_settings)
    process.crawl(FlowSpider) # it is for class FlowSpider(scrapy.Spider):
    process.start(stop_after_crawl=True)
于 2020-10-22T06:46:31.310 回答
-3
# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute


def gen_argv(s):
    sys.argv = s.split()


if __name__ == '__main__':
    gen_argv('scrapy crawl abc_spider')
    execute()

将此代码放在可以从命令行运行的路径中scrapy crawl abc_spider。(用 Scrapy 测试==0.24.6)

于 2016-07-07T09:38:19.247 回答
-6

如果你想运行一个简单的爬取,只需运行命令很容易:

爬虫爬。还有其他选项可以将您的结果导出为以某些格式存储,例如:Json、xml、csv。

scrapy crawl -o result.csv 或 result.json 或 result.xml。

你可能想试试

于 2017-12-27T09:33:13.310 回答