0
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from aibang.items import OrgItem

class OrgSpider(CrawlSpider):
  name = "org"
  allowed_domains = ["demo-site.com"]
  start_urls = [
      'http://demo-site.com/detail/17507640-419823665'
  ]

  rules = ( 
      # Item List
      Rule(SgmlLinkExtractor(allow=(r'list\/\d+$', ))),
      # Parse item
      Rule(SgmlLinkExtractor(allow=(r'detail\/\d+-\d+$', )), callback='parse_item', follow=False),
  )
  
  def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    item = OrgItem()
    try:
      item['name'] = hxs.select('//div[@class="b_title"]/h1/text()')[0].extract()
    except:
      print 'Something goes wrong, skip it'
    print item['name']

Scrapy用来抓取一些页面,但我不希望它跟随detail/xxx-xxx页面中的链接,我该如何禁用它?

我已经添加follow=False了,但它不起作用,它仍然按照里面的链接detail/xxx-xxx

======注意======

我仍然需要从里面爬detail page出来list page,但没有更多的detail page东西在里面detail page

4

1 回答 1

0
class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

deny (a regular expression (or list of)) – 单个正则表达式(或正则表达式列表),(绝对)url 必须匹配才能被排除(即不提取)。它优先于允许参数。如果没有给出(或为空),它不会排除任何链接。

我希望这个对你有用

于 2013-08-15T09:32:52.103 回答