我正在开展一个项目,以从各种服务的 Fantasy Football 联赛中获取统计数据,而雅虎是我目前所坚持的。我希望我的蜘蛛抓取公共雅虎联盟的草稿结果页面。当我运行蜘蛛时,它没有给我任何结果,也没有错误消息。它只是说:
2012-09-14 17:29:08-0700 [draft] DEBUG: Crawled (200) <GET http://football.fantasysports.yahoo.com/f1/753697/draftresults?drafttab=round> (referer: None)
2012-09-14 17:29:08-0700 [draft] INFO: Closing spider (finished)
2012-09-14 17:29:08-0700 [draft] INFO: Dumping spider stats:
{'downloader/request_bytes': 250,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 48785,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 9, 15, 0, 29, 8, 734000),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 9, 15, 0, 29, 7, 718000)}
2012-09-14 17:29:08-0700 [draft] INFO: Spider closed (finished)
2012-09-14 17:29:08-0700 [scrapy] INFO: Dumping global stats:
{}
这不是登录问题,因为有问题的页面无需登录即可访问。我从此处发布的其他问题中看到,人们已经为雅虎的其他部分工作而受到了打击。雅虎幻想是否有可能阻止蜘蛛?我已经为 ESPN 成功编写了一个,所以我认为问题不在于我的代码。无论如何,这里是:
class DraftSpider(CrawlSpider):
name = "draft"
#psycopg stuff here
rows = ["753697"]
allowed_domains = ["football.fantasysports.yahoo.com"]
start_urls = []
for row in rows:
start_urls.append("http://football.fantasysports.yahoo.com/f1/" + "%s" % (row) + "/draftresults?drafttab=round")
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("/html/body/div/div/div/div/div/div/div/table/tr")
items = []
for site in sites:
item = DraftItem()
item['pick_number'] = site.select("td[@class='first']/text()").extract()
item['pick_player'] = site.select("td[@class='player']/a/text()").extract()
item['pick_nflteam'] = site.select("td[@class='player']/span/text()").extract()
item['pick_ffteam'] = site.select("td[@class='last']/@title").extract()
items.append(item)
return items
非常感谢对此的任何见解。