TL;DR:请参阅运行 scrapy 的自包含最小示例脚本。
首先,拥有一个带有单独.cfg
, settings.py
, pipelines.py
, items.py
,spiders
包等的普通 Scrapy 项目是保留和处理 Web 抓取逻辑的推荐方法。它提供了模块化、关注点分离,使事物保持有序、清晰和可测试。
如果您按照官方 Scrapy 教程创建项目,您正在通过一个特殊的scrapy
命令行工具运行 web-scraping:
scrapy crawl myspider
但是,Scrapy
还提供了一个 API来从脚本运行爬网。
有几个关键概念需要提及:
反应器是 Twisted 中事件循环的核心——使用 Twisted 驱动应用程序的循环。事件循环是一种编程结构,它在程序中等待和分派事件或消息。它通过调用一些内部或外部的“事件提供者”来工作,通常会阻塞直到事件到达,然后调用相关的事件处理程序(“调度事件”)。反应器为许多服务提供基本接口,包括网络通信、线程和事件分派。
这是从脚本运行 Scrapy 的基本和简化过程:
创建一个Settings
实例(或用于get_project_settings()
使用现有设置):
settings = Settings() # or settings = get_project_settings()
使用传入的实例Crawler
进行实例化:settings
crawler = Crawler(settings)
实例化一个蜘蛛(这就是它最终的意义所在,对吧?):
spider = MySpider()
配置信号。如果您想拥有后处理逻辑、收集统计信息,或者至少要完成爬网,这是一个重要的步骤,因为reactor
需要手动停止扭曲。reactor
Scrapy 文档建议在spider_closed
信号处理程序中停止:
请注意,您还必须在蜘蛛完成后自己关闭 Twisted reactor。这可以通过将处理程序连接到 signals.spider_closed 信号来实现。
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
# stats here is a dictionary of crawling stats that you usually see on the console
# here we need to stop the reactor
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
这是一个使用DmozSpider
蜘蛛的示例自包含脚本,并涉及带有输入和输出处理器和项目管道的项目加载器:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
# define a pipeline
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
# define a spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/@href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = DmozSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
以通常的方式运行它:
python runner.py
items.jl
并在管道的帮助下观察导出到的项目:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
Gist 可在此处获得(请随时改进):
笔记:
如果您settings
通过实例化一个Settings()
对象来定义 - 您将获得所有默认的 Scrapy 设置。但是,例如,如果您想配置现有管道,或者配置DEPTH_LIMIT
或调整任何其他设置,则需要通过以下方式在脚本中进行设置settings.set()
(如示例中所示):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
settings.py
或者,使用预先配置了所有自定义设置的现有:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
关于该主题的其他有用链接: