23

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.

They were trying to run the command:

scrapy crawl mininova.org -o scraped_data.json -t json

I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?

4

2 回答 2

57

TL;DR:请参阅运行 scrapy 的自包含最小示例脚本

首先,拥有一个带有单独.cfg, settings.py, pipelines.py, items.py,spiders包等的普通 Scrapy 项目是保留和处理 Web 抓取逻辑的推荐方法。它提供了模块化、关注点分离,使事物保持有序、清晰和可测试。

如果您按照官方 Scrapy 教程创建项目,您正在通过一个特殊的scrapy命令行工具运行 web-scraping:

scrapy crawl myspider

但是,Scrapy提供了一个 API从脚本运行爬网

有几个关键概念需要提及:

  • Settings- 基本上是一个键值“容器”,它使用默认的内置值初始化
  • Crawlerclass - 主类,它充当使用 Scrapy 进行网络抓取所涉及的所有不同组件的粘合剂
  • Twisted reactor- 因为 Scrapy 是内置在twisted异步网络库之上的 - 要启动爬虫,我们需要将它放在Twisted Reactor中,简单来说就是一个事件循环:

反应器是 Twisted 中事件循环的核心——使用 Twisted 驱动应用程序的循环。事件循环是一种编程结构,它在程序中等待和分派事件或消息。它通过调用一些内部或外部的“事件提供者”来工作,通常会阻塞直到事件到达,然后调用相关的事件处理程序(“调度事件”)。反应器为许多服务提供基本接口,包括网络通信、线程和事件分派。

这是从脚本运行 Scrapy 的基本和简化过程:

  • 创建一个Settings实例(或用于get_project_settings()使用现有设置):

    settings = Settings()  # or settings = get_project_settings()
    
  • 使用传入的实例Crawler进行实例化:settings

    crawler = Crawler(settings)
    
  • 实例化一个蜘蛛(这就是它最终的意义所在,对吧?):

    spider = MySpider()
    
  • 配置信号。如果您想拥有后处理逻辑、收集统计信息,或者至少要完成爬网,这是一个重要的步骤,因为reactor需要手动停止扭曲。reactorScrapy 文档建议在spider_closed信号处理程序中停止:

请注意,您还必须在蜘蛛完成后自己关闭 Twisted reactor。这可以通过将处理程序连接到 signals.spider_closed 信号来实现。

def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()
    # stats here is a dictionary of crawling stats that you usually see on the console        

    # here we need to stop the reactor
    reactor.stop()

crawler.signals.connect(callback, signal=signals.spider_closed)
  • 使用传入的蜘蛛配置和启动爬虫实例:

    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    
  • 可选择开始记录

    log.start()
    
  • 启动反应器 - 这将阻止脚本执行:

    reactor.run()
    

这是一个使用DmozSpider蜘蛛的示例自包含脚本,并涉及带有输入和输出处理器项目管道的项目加载器:

import json

from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor


# define an item class
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()


# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = TakeFirst()

    desc_out = Join()


# define a pipeline
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


# define a spider
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')
            yield loader.load_item()


# callback fired when the spider is closed
def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()  # collect/log stats?

    # stop the reactor
    reactor.stop()


# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
    '__main__.JsonWriterPipeline': 100
})

# instantiate a crawler passing in settings
crawler = Crawler(settings)

# instantiate a spider
spider = DmozSpider()

# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)

# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()

# start logging
log.start()

# start the reactor (blocks execution)
reactor.run()

以通常的方式运行它:

python runner.py

items.jl并在管道的帮助下观察导出到的项目:

{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...

Gist 可在此处获得(请随时改进):


笔记:

如果您settings通过实例化一个Settings()对象来定义 - 您将获得所有默认的 Scrapy 设置。但是,例如,如果您想配置现有管道,或者配置DEPTH_LIMIT或调整任何其他设置,则需要通过以下方式在脚本中进行设置settings.set()(如示例中所示):

pipelines = {
    'mypackage.pipelines.FilterPipeline': 100,
    'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')

settings.py或者,使用预先配置了所有自定义设置的现有:

from scrapy.utils.project import get_project_settings

settings = get_project_settings()

关于该主题的其他有用链接:

于 2015-01-02T15:53:18.493 回答
25

与“一目了然”网页相比,您可能会更幸运地先浏览本教程。

本教程暗示 Scrapy 实际上是一个单独的程序。

运行该命令scrapy startproject tutorial将创建一个文件夹,该文件夹名为tutorial已为您设置的多个文件。

例如,在我的例子中, modules/packages 、 和itemspipelines添加到 root package中。settingsspiderstutorial

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

该类TorrentItem将被放置在里面items.py,并且MininovaSpider该类将进入该spiders文件夹。

一旦项目建立起来,Scrapy 的命令行参数看起来就相当简单了。它们采用以下形式:

scrapy crawl <website-name> -o <output-file> -t <output-type>

或者,如果你想在没有创建项目目录开销的情况下运行 scrapy,你可以使用runspider命令:

scrapy runspider my_spider.py
于 2013-09-16T22:49:58.843 回答