我是 python scrapy 的新手,写了一个简单的脚本来抓取我学校 bbs 的帖子。但是,当我的蜘蛛运行时,它会收到如下错误消息:
015-03-28 11:16:52+0800 [nju_spider] DEBUG: Retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427299332.A>(失败2次):[> ] 2015-03-28 11:16:52+0800 [nju_spider] DEBUG: 放弃重试 http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> (失败3次) : [>] 2015-03-28 11:16:52+0800 [nju_spider] 错误: 下载 http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> 时出错: [> ]
2015-03-28 11:16:56+0800 [nju_spider] 信息:转储 Scrapy 统计信息:{'downloader/exception_count':99,'downloader/exception_type_count/twisted.web._newclient.ResponseFailed':99,'downloader/request_bytes ':36236,'downloader/request_count':113,'downloader/request_method_count/GET':113,'downloader/response_bytes':31135,'downloader/response_count':14,'downloader/response_status_count/200':14,'dupefilter /filtered': 25, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 28, 3, 16, 56, 677065), 'item_scraped_count': 11, 'log_count/DEBUG': 127 , 'log_count/ERROR': 32, 'log_count/INFO': 8,'request_depth_max':3,'response_received_count':14,'scheduler/dequeued':113,'scheduler/dequeued/memory':113,'scheduler/enqueued':113,'scheduler/enqueued/memory':113,'start_time ': datetime.datetime(2015, 3, 28, 3, 16, 41, 874807)} 2015-03-28 11:16:56+0800 [nju_spider] 信息:蜘蛛关闭(完成)
似乎蜘蛛尝试了 url 但失败了,但这个 url 确实存在。而且在 bbs 上大约有几千个帖子,但每次我运行我的蜘蛛时,它只能随机获得其中的几个。我的代码如下所示,非常感谢您的帮助
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ScrapyTest.items import NjuPostItem
class NjuSpider(CrawlSpider):
name = 'nju_spider'
allowed_domains = ['bbs.nju.edu.cn']
start_urls = ['http://bbs.nju.edu.cn/bbstdoc?board=WarAndPeace']
rules = [Rule(LinkExtractor(allow=['bbstcon\?board=WarAndPeace&file=M\.\d+\.A']),
callback='parse_post'),
Rule(LinkExtractor(allow=['bbstdoc\?board=WarAndPeace&start=\d+']),
follow=True)]
def parse_post(self, response):
# self.log('A response from %s just arrived!' % response.url)
post = NjuPostItem()
post['url'] = response.url
post['title'] = 'to_do'
post['content'] = 'to_do'
return post