0

我使用这个 CrawlerSpider 示例作为我的 Crawler 的“主干”。

我想实现这个想法:

第一条规则遵循链接。然后将匹配的链接进一步传递给第二条规则,其中第二条规则根据模式匹配新链接并对其调用回调。

例如,我有规则:

...

start_urls = ['http://play.google.com/store']

rules = (
    Rule(SgmlLinkExtractor(allow=('/store/apps',))),
    Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)

...

我希望解析器如何工作:

  1. 打开http://play.google.com/store '并匹配第一个 URL ' https://play.google.com/store/apps/category/SHOPPING/collection/top sell_free '

  2. 将找到的 URL(' https://play.google.com/store/apps/category/SHOPPING/collection/top sell_free ')传递给第二条规则

  3. 第二条规则尝试匹配它的模式 (allow=('.*/details\?id=',))),如果匹配,则为该 URL 调用回调“parse_app”。

Atm,Crawler 只是遍历所有链接,不解析任何内容。

4

1 回答 1

1

正如许家万所暗示的,网址匹配/details\?id=也匹配/store/apps(从我简要看到的)

所以尝试改变规则的顺序,让规则parse_app首先匹配:

rules = (
    Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
    Rule(SgmlLinkExtractor(allow=('/store/apps',))),
)

或使用deny

rules = (
    Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
    Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)

如果您希望第一个 Rule()应用于“ http://play.google.com/store ”,然后使用第二个 Rule() 调用parse_app,则可能需要实现parse_start_url方法以使用生成请求SgmlLinkExtractor(allow=('/store/apps',))

就像是

from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class PlaystoreSpider(CrawlSpider):
    name = 'playstore'
    #allowed_domains = ['example.com']
    start_urls = ['https://play.google.com/store']

    rules = (
        #Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
        Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
    )

    def parse_app(self, response):
        self.log('Hi, this is an app page! %s' % response.url)
        # do something


    def parse_start_url(self, response):
        return [Request(url=link.url)
                for link in SgmlLinkExtractor(
                    allow=('/store/apps',), deny=('/details\?id=',)
                ).extract_links(response)]
于 2013-08-02T09:45:12.353 回答