0

I have the following spider for Scrapy. I need to scrape not only the top level pages in my sitemap but also the pages that are 1st-level children of those pages. Then I need to concatenate the results of the children's scrape with the body item from my parent parse method. Could anyone help me with the code to do something like this?

from scrapy.contrib.spiders import SitemapSpider
from scrapy.selector import HtmlXPathSelector
from cvorgs.items import CvorgSite

class CvorgSpider(SitemapSpider):
  name = 'cvorg_spider'
  sitemap_urls = ["http://www.urbanministry.org/cvorg_urls.xml"]

  def parse(self, response):
   hxs = HtmlXPathSelector(response)
   item = CvorgSite()
   item['url'] = response.url
   item['title'] = hxs.select('//title/text()').extract()
   item['meta'] = hxs.select('/html/head/meta[@name="description"]/@content').extract()
   body = ' '.join(hxs.select('//body//p//text()').extract())
   item['body'] = body.replace('"', '\'');
   return item
4

1 回答 1

1

好的,所以您需要像 url 一样抓取数据并重新抓取它。这里你需要使用yield函数。就像我获取一个子网址并重定向以提供一个新网址。在此示例中,
callback=self.parse_category_tilte定义了 (complete_url(link)函数的输出所在的函数:

sites1 = hxs.select('//div[@class="left-column"]/div[@class="resultContainer"]/span/h2/a/@href')
        items=[]
        for sit in sites2:
            link=sit.extract()
            yield Request(complete_url(link), callback=self.parse_category_tilte)

现在 complete_url 返回一个新的 url:

def complete_url(string):
    """Return complete url"""
    return "http://www.timeoutdelhi.net" + string

现在在 parse_category_tilte 函数中重新抓取:

sites = hxs.select('//div[@class="box-header"]/h3/text()')       
        items=[]   
        for site in sites:
            item=OnthegoItem()
            item['ename']=site.extract()
            items.append(item)
        return items

希望这会有所帮助并投票。:)

于 2013-08-21T20:25:44.373 回答