相同的代码可以毫无问题地抓取黄皮书,并且符合预期。将规则更改为 CL,它会点击第一个 url,然后在没有相关输出的情况下摇摇欲坠。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigs.items import CraigsItem
class MySpider(CrawlSpider):
        name = "craigs"
        allowed_domains = ["craiglist.org"]
        start_urls = ["http://newyork.craigslist.org/cpg/"]
        rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')]
        def parse_profile(self, response):
                found = []
                img = CraigsItem()
                hxs = HtmlXPathSelector(response)
                img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract()
                img['text'] = hxs.select('//section[contains(@id, "postingbody")]/text()').extract()
                img['tags'] =  hxs.select('//html/body/article/section/section[2]/section[2]/ul/li[1]').extract()
                print found[0]
                return found[0]
这是输出 http://pastie.org/6087878 如您所见,获取第一个 url 来抓取 http://newyork.craigslist.org/mnh/cpg/3600242403.html> 没有问题,但随后就死了。
我可以使用 CLI 并使用 xpaths 或关键字 SgmlLinkExtractor(allow=r' /cpg/.+').extract_links(response)
输出 ->   http://pastie.org/6085322
但在爬网中,相同的查询失败。怎么回事??