2

所以我正在构建这个蜘蛛,它爬得很好,因为我可以登录到 shell 并浏览 HTML 页面并测试我的 Xpath 查询。

不知道我做错了什么。任何帮助,将不胜感激。我已经重新安装了 Twisted,但没有。

我的蜘蛛长这样——

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem

class spider(BaseSpider):
name="spider1"
#allowed_domains = ["example.com"]
start_urls = [                  
              "http://www.example.com"
            ]

def parse(self, response):
 items = [] 
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//*[@id="search_results"]/div[1]/div')

    for site in sites:
        item = spiderItem()
        item['title'] = site.select('div[2]/h2/a/text()').extract                            item['author'] = site.select('div[2]/span/a/text()').extract    
        item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()     
    items.append(item)
    return items

当我运行蜘蛛时 - scrapy crawl Spider1 我收到以下错误 -

    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Enabled item pipelines:
    2012-09-25 17:56:12-0400 [Spider1] INFO: Spider opened
    2012-09-25 17:56:12-0400 [Spider1] INFO: Crawled 0 pages (at 0 pages/min), scraped  0 items (at 0 items/min)
    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
   2012-09-25 17:56:15-0400 [Spider1] DEBUG: Crawled (200) <GET http://www.example.com>  (refere
   r: None)
   2012-09-25 17:56:15-0400 [Spider1] ERROR: Spider error processing <GET    http://www.example.com
    s>
    Traceback (most recent call last):
      File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
        raise NotImplementedError
    exceptions.NotImplementedError:

     2012-09-25 17:56:15-0400 [Spider1] INFO: Closing spider (finished)
     2012-09-25 17:56:15-0400 [Spider1] INFO: Dumping spider stats:
    {'downloader/request_bytes': 231,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 186965,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 9, 25, 21, 56, 15, 326000),
     'scheduler/memory_enqueued': 1,
     'spider_exceptions/NotImplementedError': 1,
     'start_time': datetime.datetime(2012, 9, 25, 21, 56, 12, 157000)}
      2012-09-25 17:56:15-0400 [Spider1] INFO: Spider closed (finished)
      2012-09-25 17:56:15-0400 [scrapy] INFO: Dumping global stats:
    {}
4

3 回答 3

12

For everyone who faces this problem, please, make sure you didn't rename parse() method like I did:

class CakeSpider(CrawlSpider):
    name            = "cakes"
    allowed_domains = ["cakes.com"]
    start_urls      = ["http://www.cakes.com/catalog"]

    def parse(self, response): #this should be 'parse' and nothing else

        #yourcode#

Otherwise it throws the same error:

...
File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse
    raise NotImplementedError
    exceptions.NotImplementedError:

I've spent like three hours trying to figure out -.-

于 2014-08-22T11:13:56.780 回答
3

Leo 是对的,缩进是不正确的。您可能在脚本中混合了一些制表符和空格,因为您粘贴了一些代码并自己输入了其他代码,并且您的编辑器允许在同一文件中同时使用制表符和空格。将所有制表符转换为空格,这样更像:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem

class spider(BaseSpider):
    name = "spider1"
    start_urls = ["http://www.example.com"]

    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="search_results"]/div[1]/div')

        for site in sites:
            item = spiderItem()
            item['title'] = site.select('div[2]/h2/a/text()').extract
            item['author'] = site.select('div[2]/span/a/text()').extract
            item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
            items.append(item)

        return items
于 2012-09-26T15:14:45.870 回答
2

您的 parse 方法不在类代码中,请使用下面提到的代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from spider_scrap.items import spiderItem

class spider(BaseSpider):
    name="spider1"
    allowed_domains = ["example.com"]
    start_urls = [
      "http://www.example.com"
     ]

    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="search_results"]/div[1]/div')

        for site in sites:
            item = spiderItem()
            item['title'] = site.select('div[2]/h2/a/text()').extract                            item['author'] = site.select('div[2]/span/a/text()').extract
            item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()
        items.append(item)
        return items
于 2013-03-21T18:44:22.757 回答