python - 防止抓取某些网页的链接

Question

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from aibang.items import OrgItem

class OrgSpider(CrawlSpider):
  name = "org"
  allowed_domains = ["demo-site.com"]
  start_urls = [
      'http://demo-site.com/detail/17507640-419823665'
  ]

  rules = ( 
      # Item List
      Rule(SgmlLinkExtractor(allow=(r'list\/\d+$', ))),
      # Parse item
      Rule(SgmlLinkExtractor(allow=(r'detail\/\d+-\d+$', )), callback='parse_item', follow=False),
  )
  
  def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    item = OrgItem()
    try:
      item['name'] = hxs.select('//div[@class="b_title"]/h1/text()')[0].extract()
    except:
      print 'Something goes wrong, skip it'
    print item['name']

我Scrapy用来抓取一些页面，但我不希望它跟随detail/xxx-xxx页面中的链接，我该如何禁用它？

我已经添加follow=False了，但它不起作用，它仍然按照里面的链接detail/xxx-xxx。

======注意======

我仍然需要从里面爬detail page出来list page，但没有更多的detail page东西在里面detail page。

score 0 · Accepted Answer

class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

deny (a regular expression (or list of)) – 单个正则表达式（或正则表达式列表），（绝对）url 必须匹配才能被排除（即不提取）。它优先于允许参数。如果没有给出（或为空），它不会排除任何链接。

我希望这个对你有用

python - 防止抓取某些网页的链接

======注意======

1 回答 1

Related

Reference