默认情况下,您无法访问原始起始 url。
但是您可以覆盖make_requests_from_url
方法并将开始 url 放入meta
. 然后在解析中,您可以从那里提取它(如果您在该解析方法中产生后续请求,请不要忘记在其中转发该起始 url)。
我没有合作过CrawlSpider
,也许 Maxim 的建议对你有用,但请记住,response.url
在可能的重定向之后有 url。
这是我将如何做的一个例子,但这只是一个例子(取自scrapy教程)并且没有经过测试:
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse(self, response): # When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
for request_or_item in CrawlSpider.parse(self, response):
if isinstance(request_or_item, Request):
request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']})
yield request_or_item
def make_requests_from_url(self, url):
"""A method that receives a URL and returns a Request object (or a list of Request objects) to scrape.
This method is used to construct the initial requests in the start_requests() method,
and is typically used to convert urls to requests.
"""
return Request(url, dont_filter=True, meta = {'start_url': url})
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
item['start_url'] = response.meta['start_url']
return item
询问您是否有任何问题。顺便说一句,使用 PyDev 的“转到定义”功能,您可以查看 scrapy 源并了解参数Request
以及make_requests_from_url
其他类和方法所期望的。进入代码有助于并节省您的时间,尽管一开始可能看起来很困难。