python - Scrapy SgmlLinkExtractor 规则和回调让人头疼

Question

我正在尝试做：

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2']
    rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly))

def parse_loly(self, response):
    print 'Hi this is the loly page %s' % response.url
    return

这让我回想起：

NameError: name 'self' is not defined

如果我将回调更改为callback="self.parse_loly"似乎永远不会被调用并打印 URL。

但似乎可以毫无问题地抓取网站，因为我收到了许多针对该规则的 Crawled 200 调试消息。

我可能做错了什么？

提前谢谢各位！

score 1 · Accepted Answer

看起来空格parse_loly没有正确对齐。Python 对空格敏感，因此对于解释器来说，它看起来像是 SpiderSpider 之外的方法。

您可能还希望根据PEP8将规则行拆分为较短的行。

试试这个：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class SpiderSpider(CrawlSpider):
    name = "lolies"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/directory/lol2/']
    rules = (
        Rule(SgmlLinkExtractor(allow=('\w+$', ))), 
        Rule(SgmlLinkExtractor(allow=('\w+/\d+$', ))),
        Rule(SgmlLinkExtractor(allow=('\d+$',)), callback='parse_loly'),
    )

    def parse_loly(self, response):
        print 'Hi this is the loly page %s' % response.url
        return None

python - Scrapy SgmlLinkExtractor 规则和回调让人头疼

1 回答 1

Related

Reference