我在 Windows Vista 64 位上使用 Python.org 版本 2.7、64 位。我正在构建一个递归 webscraper,它在仅从单个页面中提取文本时似乎工作,但是在抓取多个页面时似乎没有工作。代码如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal3'])
我从中获得的输出示例如下:
2014-07-25 19:31:32+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Players/133260/Show/Michael-Ngoo> (referer: http://www.whoscored.com/Players/14170/Show/Ishmael-Miller)
2014-07-25 19:31:33+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/160/Show/England-Charlton> (referer: http://www.whoscored.com/Players/10794/Show/Rafik-Djebbour)
2014-07-25 19:31:33+0100 [goal3] DEBUG: Filtered offsite request to 'www.cafc.co.uk': <GET http://www.cafc.co.uk/page/Home>
2014-07-25 19:31:34+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Matches/721465/Live/England-Championship-2013-2014-Nottingham-Forest-Charlton> (referer: http://www.whoscored.com/Players/10794/Show/Rafik-Djebbour)
2014-07-25 19:31:36+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/126/News> (referer: http://www.whoscored.com/Teams/1426/News)
2014-07-25 19:31:36+0100 [goal3] DEBUG: Filtered offsite request to 'www.fcsochaux.fr': <GET http://www.fcsochaux.fr/fr/index.php?lng=fr>
2014-07-25 19:31:37+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/976/News> (referer: http://www.whoscored.com/Teams/1426/News)
2014-07-25 19:31:37+0100 [goal3] DEBUG: Filtered offsite request to 'www.grenoblefoot38.fr': <GET http://www.grenoblefoot38.fr/>
2014-07-25 19:31:37+0100 [goal3] DEBUG: Filtered offsite request to 'www.as.com': <GET http://www.as.com/futbol/articulo/leones-ponen-manos-obra-grenoble/20120713dasdaiftb_52/Tes>
2014-07-25 19:31:38+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/56/News> (referer: http://www.whoscored.com/Teams/53/News)
2014-07-25 19:31:38+0100 [goal3] DEBUG: Filtered offsite request to 'www.realracingclub.es': <GET http://www.realracingclub.es/default.aspx>
2014-07-25 19:31:39+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/125/News> (referer: http://www.whoscored.com/Teams/146/News)
2014-07-25 19:31:39+0100 [goal3] DEBUG: Filtered offsite request to 'www.asnl.net': <GET http://www.asnl.net/pages/club/entraineurs.html>
2014-07-25 19:31:40+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/425/News> (referer: http://www.whoscored.com/Teams/24/News)
2014-07-25 19:31:40+0100 [goal3] DEBUG: Filtered offsite request to 'www.dbu.dk': <GET http://www.dbu.dk/>
2014-07-25 19:31:42+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/282/News> (referer: http://www.whoscored.com/Teams/50/News)
2014-07-25 19:31:42+0100 [goal3] DEBUG: Filtered offsite request to 'www.fc-koeln.de': <GET http://www.fc-koeln.de/index.php?id=10>
2014-07-25 19:31:43+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/58/News> (referer: http://www.whoscored.com/Teams/131/News)
2014-07-25 19:31:43+0100 [goal3] DEBUG: Filtered offsite request to 'www.realvalladolid.es': <GET http://www.realvalladolid.es/>
2014-07-25 19:31:44+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/973/News> (referer: http://www.whoscored.com/Teams/145/News)
2014-07-25 19:31:44+0100 [goal3] DEBUG: Filtered offsite request to 'www.fifci.org': <GET http://www.fifci.org/>
我可以理解外部链接超出爬虫范围而被过滤掉,但是我无法理解为什么返回的结果是“DEBUG:”消息和页面链接,尤其是在成功的情况下为所有这些结果打印 200 的 HTTP 返回代码。
谁能看到这里有什么问题?
谢谢