4

I am trying to run scrapy from a script and I cannot get the program to create the export file

I have tried to get the file to export in two different ways:

  1. With a pipeline
  2. With Feed export.

Both of these ways work when I run scrapy from the command line, but neither work when I run scrapy from a script.

I am not alone with this problem. Here are two other similar unanswered questions. I did not notice these until after I posted the question.

  1. JSON not working in scrapy when calling spider through a python script?
  2. Calling scrapy from a python script not creating JSON output file

Here is my code to run scrapy from the script. It includes the settings to print an output file with both the pipeline and the feed exporter.

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
import logging

from external_links.spiders.test import MySpider
from scrapy.utils.project import get_project_settings
settings = get_project_settings()

#manually set settings here
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('reactor running...')
reactor.run()
log.msg('Reactor stopped...')

In after I run this code the log says: "Stored csv feed (341 items) in: output.csv", but there is no output.csv to be found.

here is my feed exporter code:

settings = get_project_settings()

#manually set settings here
settings.set('ITEM_PIPELINES',   {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')


from scrapy.contrib.exporter import CsvItemExporter


class CsvOptionRespectingItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        delimiter = settings.get('CSV_DELIMITER', ',')
        kwargs['delimiter'] = delimiter
        super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)

Here is my pipeline code:

import csv

class CsvWriterPipeline(object):

def __init__(self):
    self.csvwriter = csv.writer(open('items2.csv', 'wb'))

def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
    self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])

    return item
4

1 回答 1

2

我有同样的问题。

这对我有用:

  1. 将导出uri放入settings.py

    FEED_URI='file:///tmp/feeds/filename.jsonlines'

  2. 使用以下内容scrape.py在您旁边创建一个脚本scrapy.cfg

     
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    
    process = CrawlerProcess(get_project_settings())
    
    process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
    process.start() # the script will block here until the crawling is finished
    
    
  3. 跑 :python scrape.py

结果:文件已创建。

注意:我的项目中没有管道。所以不确定管道是否会过滤您的结果。

另外:这是帮助我的文档中常见的陷阱部分

于 2016-06-14T08:57:31.143 回答