2

在敲了几次头之后,我终于来到了这里。

问题:我正在尝试下载每个 craiglist 帖子的内容。我所说的内容是指对手机的描述之类的“发帖主体”。寻找一部新的旧手机,因为 iPhone 已经完成了所有的兴奋。

该代码是Michael Herman的出色作品。

我的蜘蛛班

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from craig.items import CraiglistSampleItem

class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://minneapolis.craigslist.org/moa/"]

    rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@class="nextpage"]',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self,response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        items = []
        for titles in titles:
            item = CraiglistSampleItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

和 Item 类

from scrapy.item import Item, Field

class CraiglistSampleItem(Item):
    title = Field()
    link = Field()

由于代码将遍历许多链接,因此我想将每部手机的描述保存在单独的 csv 中,但在 csv 中再添加一列也可以。

任何领先!

4

1 回答 1

5

而不是在方法中返回项目,parse_items您应该返回/产生 scrapy Request实例以便从项目页面获取描述,link并且您可以在字典内部和内部title传递:ItemItem

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *

from scrapy.item import Item, Field


class CraiglistSampleItem(Item):
    title = Field()
    link = Field()
    description = Field()


class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://minneapolis.craigslist.org/moa/"]

    rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]',))
        , callback="parse_items", follow=True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)

        titles = hxs.select("//span[@class='pl']")
        for title in titles:
            item = CraiglistSampleItem()
            item["title"] = title.select("a/text()").extract()[0]
            item["link"] = title.select("a/@href").extract()[0]

            url = "http://minneapolis.craigslist.org%s" % item["link"]
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)

    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

运行它并查看description输出 csv 文件中的附加列。

希望有帮助。

于 2013-07-07T18:41:32.570 回答