0

我正在使用 Scrapy 1.0.5 和 Gearman 创建分布式蜘蛛。这个想法是构建一个蜘蛛,从一个 gearman 工作脚本调用它,并一次传递 20 个 URL,从一个 gearman 客户端爬到工作人员,然后到蜘蛛。

我能够启动工作人员,将 URL 从客户端传递给蜘蛛进行爬行。第一个 URL 或 URL 数组确实会被拾取和抓取。蜘蛛完成后,我无法重用它。我收到蜘蛛关闭的日志消息。当我再次启动客户端时,蜘蛛会重新打开,但不会爬行。

这是我的工人:

import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


gm_worker = gearman.GearmanWorker(['localhost:4730'])

def task_listener_reverse(gearman_worker, gearman_job):
    process = CrawlerProcess(get_project_settings())

    data = json.loads(gearman_job.data)
    if(data['vendor_name'] == 'walmart'):
        process.crawl('walmart', url=data['url_list'])
        process.start() # the script will block here until the crawling is finished
        return 'completed'

# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)

# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()

这是我的蜘蛛的代码。

    from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider




class WalmartSpider(Spider):
    name = "walmart"

    def __init__(self, **kw):
        super(WalmartSpider, self).__init__(**kw)
        self.start_urls = kw.get('url')
        self.allowed_domains = ["walmart.com"]

    def parse(self, response):

        item = CrawlerItemLoader(response=response)

        item.add_value('url', response.url)


        #Title
        item.add_xpath('title', '//div/h1/span/text()')

        if(response.xpath('//div/h1/span/text()')):
            title = response.xpath('//div/h1/span/text()')


        item.add_value('title', title)

        yield item.load_item()

第一个客户端运行会产生结果,无论是单个 URL 还是多个 URL,我都会获得所需的数据。

在第二次运行时,蜘蛛打开并且没有结果。这就是我回来的东西,它停止了

    2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047

我能够从 worker 和 spider 打印一个或多个 URL,并确保它们在第一次工作运​​行和第二次非工作运行时被传递。我花了 2 天时间,但没有得到任何结果。我会很感激任何指示。

4

1 回答 1

0

好吧,我决定放弃 Scrapy。我环顾四周,每个人都不断指出扭曲反应堆的局限性。我决定构建自己的爬虫,而不是与框架抗争,它非常成功地满足了我的需求。我能够启动多个齿轮工作人员并使用我构建的刮板在服务器场中同时刮取数据。

如果有人感兴趣,我从这篇简单的文章开始构建刮板。我使用 gearman 客户端查询数据库并将多个 url 发送给工作人员,工作人员抓取 URL 并将更新查询返回给数据库。成功!!:)

http://docs.python-guide.org/en/latest/scenarios/scrape/

于 2016-03-08T03:20:31.603 回答