我正在尝试废弃以下页面的结果:
http://www.peekyou.com/work/autodesk/page=1
page = 1,2,3,4 ...根据结果依此类推。所以我得到一个 php 文件来运行爬虫,为不同的页码运行它。代码(单页)如下:
`import sys
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
#from scrapy.crawler import CrawlerProcess
class DmozSpider(BaseSpider):
name = "peekyou_crawler"
start_urls = ["http://www.peekyou.com/work/autodesk/page=1"];
def parse(self, response):
hxs = HtmlXPathSelector(response)
discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]')
print len(discovery)
print "Starting the actual file"
items = hxs.select('//div[@class="resultCell"]')
count = 0
for newsItem in items:
print newsItem
url=newsItem.select('h2/a/@href').extract()
name = newsItem.select('h2/a/span/text()').extract()
count = count + 1
print count
print url[0]
print name[0]
print "\n"
` Autodesk 结果页面有 18 页。当我运行代码来爬取所有页面时,爬虫只从第 2 页获取数据,而不是所有页面。同样,我将公司名称更改为其他名称。同样,它会删除一些页面并且不会休息。我在每个页面上都收到了 http 响应 200。此外,即使我再次运行它,它也会继续删除相同的页面,但并非总是如此。关于我的方法中可能出现的错误或我缺少什么的任何想法?
提前致谢。