python - python scrapy在尝试使用参数时无法找到蜘蛛

Question

我已经成功地为域的每个网页创建了一个蜘蛛检索链接。

我想做同样的事情，但是对于我托管的多个域，为此，我更喜欢使用我的蜘蛛，只需将其添加为要监视的域的参数。

这里的文档解释说我们应该明确定义构造函数并在其中添加参数，然后使用命令 scrapy crawl myspider 启动蜘蛛。

这是我的代码：

class MySpider(BaseSpider):
    name= 'spider'

    def __init__(self, domain='some_domain.net'):
        self.domain = domain
        self.allowed_domains = [self.domain]
        self.start_urls = [ 'http://'+self.domain ]

    def parse(self, response):
        hxs = HtmlPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not url.startswith('http://'):
                url= URL + url 
            print url
        yield Request(url, callback=self.parse)

然而，启动

scrapy crawl spider -a domain='mydomain.my_extension'

返回：

ERROR: unable to find spider: spider

当我启动相同的代码，但没有显式构造函数时，我不能用 crawl 来做到这一点，我必须使用这个命令：

scrapy runspider /path/to/spider/spider.py

而且我不能在runspider中使用参数，我必须运行 crawl

为什么不能使用scrapy crawl spider？为什么蜘蛛的名字永远不会被scrapy crawl 读取，就像scrapy runpider 一样？

Scrapy 看起来很棒，但第二眼看起来很令人不安：/

非常感谢您的帮助

score 0 · Accepted Answer

如果你运行 scrapy 0.14 你应该在类级别而不是在实例级别设置变量。我认为这在 0.15 中发生了变化

我建议阅读文档：http ://doc.scrapy.org/en/0.14/topics/spiders.html

class MySpider(BaseSpider):
        name= 'spider'
        domain = domain
        allowed_domains = [self.domain]
        start_urls = [ 'http://'+self.domain ]


    def parse(self, response):
        hxs = HtmlPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not url.startswith('http://'):
                url= URL + url 
            print url
        yield Request(url, callback=self.parse)

python - python scrapy在尝试使用参数时无法找到蜘蛛

1 回答 1

Related

Reference