0

我想创建一个 Scrapy 脚本来抓取任何 craigslist 子域中计算机演出的所有结果:例如这里: http: //losangeles.craigslist.org/search/cpg/ 此查询返回许多文章的列表,我'我尝试使用 CrawlSpider 和 linkExtractor 抓取每个结果的标题和 href(不仅是第一页上的结果),但无济于事,但脚本什么也没返回。我会把我的脚本贴在这里,谢谢

    import scrapy
    from scrapy.spiders import Rule,CrawlSpider
    from scrapy.linkextractors import LinkExtractor

    class CraigspiderSpider(CrawlSpider):
        name = "CraigSpider"
        allowed_domains = ["http://losangeles.craigslist.org"]
        start_urls = (
                    'http://losangeles.craigslist.org/search/cpg/',
        )

        rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_page", follow= True),)

        def parse_page(self, response):
            items = response.selector.xpath("//p[@class='row']")
        for i in items:
            link = i.xpath("./span[@class='txt']/span[@class='pl']/a/@href").extract()
            title = i.xpath("./span[@class='txt']/span[@class='pl']/a/span[@id='titletextonly']/text()").extract()
            print link,title
4

1 回答 1

0

根据您粘贴的代码,parse_page

  1. 不返回/产生任何东西,并且
  2. 只包含一行:“items = response.selector...”

上面 #2 的原因是for循环没有正确缩进。

尝试缩进for循环:

class CraigspiderSpider(CrawlSpider):
    name = "CraigSpider"
    allowed_domains = ["http://losangeles.craigslist.org"]
    start_urls = ('http://losangeles.craigslist.org/search/cpg/',)

    rules = (Rule(
        LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)),
        callback="parse_page", follow= True))

    def parse_page(self, response):
        items = response.selector.xpath("//p[@class='row']")

        for i in items:
            link = i.xpath("./span[@class='txt']/span[@class='pl']/a/@href").extract()
            title = i.xpath("./span[@class='txt']/span[@class='pl']/a/span[@id='titletextonly']/text()").extract()
            print link, title
            yield dict(link=link, title=title)
于 2016-03-12T16:47:30.053 回答