我在尝试抓取特定网站时遇到了一个奇怪的问题。如果我使用 basespider 爬取某些页面,则代码可以完美运行,但是如果我将代码更改为使用 crawlspider,则蜘蛛完成时不会出现任何错误,但不会出现爬虫
以下代码工作正常
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(BaseSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
def parse(self, response):
print response.body
print "Inside parse Item"
return []
以下代码不起作用
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(CrawlSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
rules = (
Rule(SgmlLinkExtractor(allow=()),
'parseItem',
follow=True,
),
)
def parseItem(self, response):
print response.body
print "Inside parse Item"
return []
Scrapy 运行的输出如下
scrapy crawl hushbabies
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats:
{'downloader/request_bytes': 634,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 44395,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965),
'scheduler/memory_enqueued': 2,
'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)}
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished)
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 27820032, 'memusage/startup': 27652096}
将站点从 hushbabies.com 更改为其他站点将使代码正常工作。