我遇到了一个问题,即我的 CrawlSpider 没有抓取整个网站。我正在尝试抓取新闻网站;它收集了大约 5900 个项目,然后以“已完成”的原因退出,但刮掉的项目中有很大的日期差距。我没有使用任何自定义中间件或设置。谢谢你的帮助!
我的蜘蛛(请原谅底部杂乱的列表代码)和之后日志文件的最后几行:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from news.items import NewsItem
import re
class CrawlSpider(CrawlSpider):
name = 'crawl'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/portal//']
rules = (
Rule(SgmlLinkExtractor(allow=r'news/pages/.*|[Gg]et[Pp]age/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
p = re.compile(r"(%\d.+)|(var LEO).*|(createInline).*|(<.*?>|\r+|\n+|\s{2,}|\t|[\'])|(\xa0+|\xe2+|\x80+|\\x9.+)")
hxs = HtmlXPathSelector(response)
i = NewsItem()
i['headline'] = hxs.select('//p[@class = "detailedArticleTitle"]/text()').extract()[0].strip().encode("utf-8")
i['date'] = hxs.select('//div[@id = "DateTime"]/text()').re('\d+/\d+/[12][09]\d\d')[0].encode("utf-8")
text = [graf.strip().encode("utf-8") for graf in hxs.select('//div[@id = "article"]//div[@style = "LINE-HEIGHT: 100%"]|//div[@id = "article"]//p//text()').extract()]
text2 = ' '.join(text)
text3 = re.sub("'", ' ', p.sub(' ', text2))
i['text'] = re.sub('"', ' ', text3)
return i
日志输出:
2012-04-19 11:13:57-0700 [crawl] INFO: Closing spider (finished)
2012-04-19 11:13:57-0700 [crawl] INFO: Stored csv feed (5949 items) in: news.csv
2012-04-19 11:13:57-0700 [crawl] INFO: Dumping spider stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/twisted.internet.error.ConnectionLost': 2,
'downloader/request_bytes': 5778930,
'downloader/request_count': 12380,
'downloader/request_method_count/GET': 12380,
'downloader/response_bytes': 635795595,
'downloader/response_count': 12378,
'downloader/response_status_count/200': 6081,
'downloader/response_status_count/302': 6062,
'downloader/response_status_count/400': 234,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 4, 19, 18, 13, 57, 343594),
'item_scraped_count': 5949,
'request_depth_max': 23,
'scheduler/disk_enqueued': 12380,
'spider_exceptions/IndexError': 131,
'start_time': datetime.datetime(2012, 4, 19, 17, 16, 40, 75935)}
2012-04-19 11:13:57-0700 [crawl] INFO: Spider closed (finished)
2012-04-19 11:13:57-0700 [scrapy] INFO: Dumping global stats:
{}