下面是我整理的一个 Scrapy 蜘蛛,用于从网页中提取一些元素。我从另一个 Stack Overflow 解决方案借用了这个解决方案。它有效,但我需要更多。验证后,我需要能够在 start_requests 方法内的 for 循环中指定的一系列页面。
是的,我确实找到了讨论这个问题的 Scrapy 文档以及之前非常相似的解决方案。两者似乎都没有多大意义。据我所知,我需要以某种方式创建一个请求对象并继续传递它,但似乎无法弄清楚如何做到这一点。
预先感谢您的帮助。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class MyBasicSpider(BaseSpider):
name = "awBasic"
allowed_domains = ["americanwhitewater.org"]
def start_requests(self):
'''
Override BaseSpider.start_requests to crawl all reaches in series
'''
# for every integer from one to 5000
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield self.make_requests_from_url('https://mycrawlsite.com/{0}/'.format(iStr))
def parse(self, response):
# create xpath selector object instance with response
hxs = HtmlXPathSelector(response)
# get part of url string
url = response.url
id = re.findall('/(\d{4})/', url)[0]
# selector 01
attribute01 = hxs.select('//div[@id="block_1"]/text()').re('([^,]*)')[0]
# selector for river section
attribute02 = hxs.select('//div[@id="block_1"]/div[1]/text()').extract()[0]
# print results
print('\tID: {0}\n\tAttr01: {1}\n\tAttr02: {2}').format(reachId, river, reachName)