2

我的大学很快就要开始了,所以我决定为“给我的教授打分”建立一个网络爬虫,以帮助我找到学校里评价最高的老师。刮刀效果很好……但仅适用于第二页!无论我尝试什么,我都无法让它正常工作。

这是我从中抓取的 URL:http ://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=3 (不是我实际的大学,但具有相同类型的 URL 结构)

这是我的蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem

class MySpider(CrawlSpider):
    name = "rmp"
    allowed_domains = ["ratemyprofessors.com"]
    start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

    rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d',), restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

    def parser(self, response):
        hxs = HtmlXPathSelector(response)
        html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
        profs = []
        for line in html:
            prof = RmpItem()
            prof["name"] = line.select("div[@class='profName']/a/text()").extract()
            prof["dept"] = line.select("div[@class='profDept']/text()").extract()
            prof["ratings"] = line.select("div[@class='profRatings']/      text()").extract()
            prof["avg"] = line.select("div[@class='profAvg']/text()").extract()
            profs.append(prof)

我尝试过的一些事情包括删除 restrict_xpaths 关键字参数(导致刮板在第一个、最后一个、下一个和后退按钮之后,因为它们都共享 &pageNo=\d URL 结构)并更改允许关键字的正则表达式论据(结果没有变化)。

有人有什么建议吗?这似乎是一个简单的问题,但我已经花了一个半小时试图弄清楚!任何帮助将不胜感激。

4

2 回答 2

3

当页面参数未按预期顺序时,该站点不能很好地处理页面参数。查看href价值:

$ curl -q -s  "http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2"  |grep \"next\"
    <a href="/SelectTeacher.jsp?sid=2311&pageNo=3" id="next">c</a>
$ curl -q -s  "http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311"  |grep \"next\"
    <a href="/SelectTeacher.jsp?pageNo=2&sid=2311&pageNo=3" id="next">c</a>

为避免修改原始 url,您应该使用该类的canonicalize=False参数SgmlLinkExtractor。此外,您可能希望使用不太具体的 xpath 规则,因为使用当前规则您无法获得起始 url 的项目。

像这样:

rules = [
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="pagination"]', 
                           canonicalize=False),
         callback='parser', follow=True),
]
于 2013-09-19T01:26:05.143 回答
0

我在 Scrapy Google Groups 页面上发帖并收到了答复!这里是:

我想你可能发现了一个错误

当我在 scrapy shell 中获取第一页时,SgmlLinkExtractor 在第二页之后出现问题

(py2.7)paul@wheezy:~/tmp/rmp$ scrapy shell http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311 ...

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response) [Link(url=' http://www.ratemyprofessors. com/SelectTeacher.jsp?pageNo=2&sid=2311 ', text=u'c', fragment='', nofollow=False)]

fetch(' http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311 ') 2013-09-19 02:05:38+0200 [rmpspider] DEBUG: Crawled (200) http://www .ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311> (referer: None) ... SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response) [ Link(url=' http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&pageNo=3&sid=2311 ', text=u'c', fragment='', nofollow=False)]

但是当我直接从第2页开始运行shell时,下一页就可以了,但是从第3页开始的下一个链接又出错了

(py2.7)paul@wheezy:~/tmp/rmp$ scrapy shell " http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2 " ...

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response) [Link(url=' http://www.ratemyprofessors. com/SelectTeacher.jsp?pageNo=3&sid=2311 ', text=u'c', fragment='', nofollow=False)]

fetch(' http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311 ') ... SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links (回复)[链接(url=' http ://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&pageNo=4&sid=2311',text=u'c',fragment='',nofollow=False)]

同时,您可以使用 BaseSpider 编写等效的蜘蛛并“手动”构建下一页请求,使用一点 HtmlXPathSelector select() 和 urlparse.urljoin()

#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.spider import BaseSpider
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem
import urlparse

class MySpider(BaseSpider):
    name = "rmpspider"
    allowed_domains = ["ratemyprofessors.com"]
    start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

    #rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
        for line in html:
            prof = RmpItem()
            prof["name"] = line.select("div[@class='profName']/a/text()").extract()
            yield prof

        for url in hxs.select('//a[@id="next"]/@href').extract():
            yield Request(urlparse.urljoin(response.url, url))
于 2013-09-19T00:52:19.090 回答