python - Scrapy - 抓取和抓取网站

Question

作为学习使用 Scrapy 的一部分，我曾尝试爬取亚马逊，但在抓取数据时出现问题，

我的代码输出如下：

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13',
              u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate/dp/0062065351/ref=sr_1_14?s=books&ie=UTF8&qid=1361774694&sr=1-14',
              u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15',
              u'http://www.amazon.com/Inferno-Robert-Langdon-Dan-Brown/dp/0385537859/ref=sr_1_16?s=books&ie=UTF8&qid=1361774694&sr=1-16',
              u'http://www.amazon.com/Memory-Light-Wheel-Time/dp/0765325950/ref=sr_1_17?s=books&ie=UTF8&qid=1361774694&sr=1-17',
              u'http://www.amazon.com/Jesus-Calling-Enjoying-Peace-Presence/dp/1591451884/ref=sr_1_18?s=books&ie=UTF8&qid=1361774694&sr=1-18',
              u'http://www.amazon.com/Fifty-Shades-Grey-Book-Trilogy/dp/0345803485/ref=sr_1_19?s=books&ie=UTF8&qid=1361774694&sr=1-19',
              u'http://www.amazon.com/Fifty-Shades-Trilogy-Darker-3-/dp/034580404X/ref=sr_1_20?s=books&ie=UTF8&qid=1361774694&sr=1-20',
              u'http://www.amazon.com/Wheat-Belly-Lose-Weight-Health/dp/1609611543/ref=sr_1_21?s=books&ie=UTF8&qid=1361774694&sr=1-21',
              u'http://www.amazon.com/Publication-Manual-American-Psychological-Association/dp/1433805618/ref=sr_1_22?s=books&ie=UTF8&qid=1361774694&sr=1-22',
              u'http://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992259/ref=sr_1_23?s=books&ie=UTF8&qid=1361774694&sr=1-23',
              u'http://www.amazon.com/Inquebrantable-Spanish-Jenni-Rivera/dp/1476745420/ref=sr_1_24?s=books&ie=UTF8&qid=1361774694&sr=1-24'],
     'title': [u'ObamaCare Survival Guide',
               u'The Official SAT Study Guide, 2nd edition',
               u'Inferno: A Novel (Robert Langdon)',
               u'A Memory of Light (Wheel of Time)',
               u'Jesus Calling: Enjoying Peace in His Presence',
               u'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy',
               u'Fifty Shades Trilogy: Fifty Shades of Grey, Fifty Shades Darker, Fifty Shades Freed 3-volume Boxed Set',
               u'Wheat Belly: Lose the Wheat, Lose the Weight, and Find Your Path Back to Health',
               u'Publication Manual of the American Psychological Association, 6th Edition',
               u'The One and Only Ivan',
               u'Inquebrantable (Spanish Edition)'],
     'visit_id': '2f4d045a9d6013ef4a7cbc6ed62dc111f6111633',
     'visit_status': 'new'}

但是，我希望像这样捕获输出，

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13'],
     'title': [u'ObamaCare Survival Guide']}

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15'],
     'title': [u'The Official SAT Study Guide, 2nd edition']}

我认为这不是scrapy或爬虫的问题，而是编写了FOR循环。

以下是代码，

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from Amaze.items import AmazeItem

class AmazeSpider2(CrawlSpider):
    name = "scanon"
    allowed_domains = ["www.amazon.com"]
    start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books"]

    rules = (
        Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", follow= True),
        )

    def parse_items_1(self, response):
        items = []
        print ('*** response:', response.url)
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//h3')
        for title in titles:
            item = AmazeItem()
            item["title"] = title.select('//a[@class="title"]/text()').extract()
            item["link"] = title.select('//a[@class="title"]/@href').extract()
            print ('**parse-items_1:', item["title"], item["link"])
            items.append(item)
        return items

任何帮助！

score 3 · Accepted Answer

问题出在您的 Xpath 中

def parse_items_1(self, response):
        items = []
        print ('*** response:', response.url)
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//h3')
        for title in titles:
            item = AmazeItem()
            item["title"] = title.select('.//a[@class="title"]/text()').extract()
            item["link"] = title.select('.//a[@class="title"]/@href').extract()
            print ('**parse-items_1:', item["title"], item["link"])
            items.append(item)
        return items

在上面的 Xpaths 中，您需要.在 xpath 中使用以查看title其他方式，否则您的 xpath 将查看整个页面，因此它将获得匹配项并返回它们，

score 1 · Accepted Answer

顺便说一句 - 你可以在 Scrapy Shell 中测试我们的 Xpath 表达式 - http://doc.scrapy.org/en/latest/topics/shell.html

做得对，它将为您节省数小时的工作和头痛。:)

score 0 · Accepted Answer

用于yield制作生成器并修复您的 xpath 选择器：

def parse_items_1(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select('//h3')

    for title in titles:
        item = AmazeItem()
        item["title"] = title.select('.//a[@class="title"]/text()').extract()
        item["link"] = title.select('.//a[@class="title"]/@href').extract()

        yield item

python - Scrapy - 抓取和抓取网站

3 回答 3

Related

Reference