0

我在使用 Scrapy 时遇到问题,由于某种原因它没有进入我的解析方法,我不知道为什么会这样。我尝试了不同的选择但没有成功。

这就是我的代码现在的样子。具体来说,有两条打印语句,而 parse() 方法中的一条没有被调用。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from comments.items import CustomerReview
import re

class AppidSpider(BaseSpider):

  name = "appid"
    allowed_domains = ["itunes.apple.com"]
    start_urls = [
        "http://itunes.apple.com/us/genre/ios/id36?mt=8"
    ]

    rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse')]
    print "---> THIS IS TEST 1"

    def parse(self, response):
        print " ----> THIS IS TEST 2"
        # ... More code afterwards

这就是输出。如您所见, TEST 2 从未打印过。

$ scrapy crawl appid
2012-07-05 13:41:02+0000 [scrapy] INFO: Scrapy 0.14.4 started (bot: comments)
2012-07-05 13:41:02+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
---> THIS IS TEST 1
2012-07-05 13:41:02+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-05 13:41:02+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-05 13:41:02+0000 [scrapy] DEBUG: Enabled item pipelines: FilterWordsPipeline
2012-07-05 13:41:02+0000 [appid] INFO: Spider opened
2012-07-05 13:41:02+0000 [appid] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-05 13:41:02+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-05 13:41:02+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-05 13:41:02+0000 [appid] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/ios/id36?mt=8> (referer: None)
2012-07-05 13:41:02+0000 [appid] INFO: Closing spider (finished)
2012-07-05 13:41:02+0000 [appid] INFO: Dumping spider stats:
        {'downloader/request_bytes': 222,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 9927,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2012, 7, 5, 13, 41, 2, 694678),
         'scheduler/memory_enqueued': 1,
         'start_time': datetime.datetime(2012, 7, 5, 13, 41, 2, 604025)}
2012-07-05 13:41:02+0000 [appid] INFO: Spider closed (finished)
2012-07-05 13:41:02+0000 [scrapy] INFO: Dumping global stats:
        {'memusage/max': 95318016, 'memusage/startup': 95318016}
4

2 回答 2

2

正如 Creshal 在回调中所说,您需要使用调用某些方法而不是解析您的其他一些自定义方法。

但在这种情况下,这不应该是问题,因为您实际上没有任何规则要遵循,所以这并不重要。

我试过你的代码,它对我来说很好,它会打印出两条消息。

于 2012-07-05T14:02:57.610 回答
1

为什么将 parse 作为字符串传递?试试callback=self.parse吧。

于 2012-07-05T13:57:34.037 回答