python - 从python脚本运行scrapy

Question

我一直在尝试从 python 脚本文件运行 scrapy，因为我需要获取数据并将其保存到我的数据库中。但是当我用scrapy命令运行它时

scrapy crawl argos

该脚本运行良好，但是当我尝试使用脚本运行它时，请按照此链接

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

我得到这个错误

$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
  File "pricewatch/pricewatch.py", line 39, in <module>
    main()
  File "pricewatch/pricewatch.py", line 31, in main
    update()
  File "pricewatch/pricewatch.py", line 24, in update
    setup_crawler("argos.co.uk")
  File "pricewatch/pricewatch.py", line 13, in setup_crawler
    settings = get_project_settings()
  File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
    settings_module = import_module(settings_module_path)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named settings

我无法理解为什么它没有找到 get_project_setting() 但在终端上使用 scrapy 命令运行良好

这是我的项目的屏幕截图

在此处输入图像描述

这是 pricewatch.py 代码：

import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings

def setup_crawler(domain):
    spider = ArgosSpider(domain=domain)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

def update():
    #print "Enter a product to update:"
    #product = raw_input()
    #print product
    #db = DBInstance()
    setup_crawler("argos.co.uk")
    log.start()
    reactor.run()

def main():
    try:
        if sys.argv[1] == "update":
            update()
        elif sys.argv[1] == "database":
            #db = DBInstance()
    except IndexError:
        print "You must select a command from Update, Search, History"


if  __name__ =='__main__':
    main()

score 2 · Accepted Answer

我已经修复了它只需要将 pricewatch.py 放到项目的顶级目录然后运行它就可以解决它

score 0 · Accepted Answer

这个答案大量复制了这个答案，我相信它回答了你的问题，并另外提供了一个下降的例子。

考虑一个具有以下结构的项目。

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

基本上，该命令 scrapy startproject scraper在 my_project 文件夹中执行，我已将一个run_scraper.py文件添加到外部刮刀文件夹，将一个main.py文件添加到我的根文件夹和quotes_spider.pyspiders 文件夹。

我的主文件：

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

我的run_scraper.py文件：

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spiders = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

另外，请注意设置可能需要查看，因为路径需要根据根文件夹（my_project，而不是 scraper）。所以在我的情况下：

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

ETC...

python - 从python脚本运行scrapy

2 回答 2

Related

Reference