0

我开始测试 Scrapy 以抓取网站,但是当我测试我的代码时,我收到一个错误,我似乎无法理解如何解决。

这是错误输出:

...
2012-12-18 02:07:19+0000 [dmoz] DEBUG: Crawled (200) <GET http://MYURL.COM> (referer: None)
2012-12-18 02:07:19+0000 [dmoz] ERROR: Spider error processing <GET http://MYURL.COM>
    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.3-py2.7.egg/scrapy/spider.py", line 57, in parse
        raise NotImplementedError
    exceptions.NotImplementedError: 

2012-12-18 02:07:19+0000 [dmoz] INFO: Closing spider (finished)
2012-12-18 02:07:19+0000 [dmoz] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 357,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 20704,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 12, 18, 2, 7, 19, 595977),
     'log_count/DEBUG': 7,
     'log_count/ERROR': 1,
     'log_count/INFO': 4,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'spider_exceptions/NotImplementedError': 1,
     'start_time': datetime.datetime(2012, 12, 18, 2, 7, 18, 836322)}

看起来这可能与我的parse函数和回调有关。我尝试删除rule它,但它只适用于 1 个 URL,我需要的是抓取整个网站。

这是我的代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item


class DmozSpider(BaseSpider):
    name = "dmoz"
    start_urls = ["http://MYURL.COM"]
    rules = (Rule(SgmlLinkExtractor(allow_domains=('http://MYURL.COM', )), callback='parse_l', follow=True),)


    def parse_l(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@class=\'content\']')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('//div[@class=\'gig-title-g\']/h1').extract()
           item['link'] = site.select('//ul[@class=\'gig-stats prime\']/li[@class=\'queue \']/div[@class=\'big-txt\']').extract()
           item['desc'] = site.select('//li[@class=\'thumbs\'][1]/div[@class=\'gig-stats-numbers\']/span').extract()
           items.append(item)
       return items 

任何正确方向的提示将不胜感激。

非常感谢!

4

1 回答 1

3

找到了这个问题的答案:

为什么在尝试爬取和解析网站时,scrapy 会为我抛出错误?

看起来BaseSpider没有实现Rule

如果您偶然发现了这个问题并且您正在使用BaseSpider爬网,则需要将其更改为CrawlSpider并导入它,如http://doc.scrapy.org/en/latest/topics/spiders.html中所述

from scrapy.contrib.spiders import CrawlSpider, Rule
于 2012-12-18T03:54:22.300 回答