正如许家万所暗示的,网址匹配/details\?id=
也匹配/store/apps
(从我简要看到的)
所以尝试改变规则的顺序,让规则parse_app
首先匹配:
rules = (
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
Rule(SgmlLinkExtractor(allow=('/store/apps',))),
)
或使用deny
rules = (
Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
如果您希望第一个 Rule()仅应用于“ http://play.google.com/store ”,然后使用第二个 Rule() 调用parse_app
,则可能需要实现parse_start_url方法以使用生成请求SgmlLinkExtractor(allow=('/store/apps',))
就像是
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class PlaystoreSpider(CrawlSpider):
name = 'playstore'
#allowed_domains = ['example.com']
start_urls = ['https://play.google.com/store']
rules = (
#Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
def parse_app(self, response):
self.log('Hi, this is an app page! %s' % response.url)
# do something
def parse_start_url(self, response):
return [Request(url=link.url)
for link in SgmlLinkExtractor(
allow=('/store/apps',), deny=('/details\?id=',)
).extract_links(response)]