3

我需要抓取一个网站,并在特定的 xpath 上抓取该网站的每个 url。例如:我需要抓取在容器中有 10 个链接的“ http://someurl.com/world/”(xpath("/ /div[@class='pane-content']")) 我需要抓取所有这 10 个链接并从中提取图像,但是“ http://someurl.com/world/ ”中的链接看起来像“ http ://someurl.com/node/xxxx "

到目前为止我所拥有的:

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['someurl.com/']
    start_urls = ['http://someurl.com/news']
    rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = response.xpath(\
            "//h1[@class='pane-content']/a/text()").extract()
        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = response.xpath("//img/@src").extract()
        return image
4

1 回答 1

2

您可以重写您的“规则”以适应您的所有要求:

rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]

要从提取的图像链接下载图像,您可以使用 Scrapy 捆绑的ImagePipeline

于 2015-10-25T19:08:53.720 回答