How can I get the request url in Scrapy's parse()
function? I have a lot of urls in start_urls
and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url
to store these urls. I'm using the BaseSpider.
问问题
52590 次
5 回答
93
传递给 parse() 的“响应”变量具有您想要的信息。您不需要覆盖任何内容。
例如。(已编辑)
def parse(self, response):
print "URL: " + response.request.url
于 2015-01-25T07:50:23.353 回答
17
可以从响应对象访问请求对象,因此您可以执行以下操作:
def parse(self, response):
item['start_url'] = response.request.url
于 2015-12-29T03:57:56.733 回答
7
您需要覆盖 BaseSpider 的make_requests_from_url(url)
函数以将 start_url 分配给项目,然后使用Request.meta
特殊键将该项目传递给parse
函数
from scrapy.http import Request
# override method
def make_requests_from_url(self, url):
item = MyItem()
# assign url
item['start_url'] = url
request = Request(url, dont_filter=True)
# set the meta['item'] to use the item in the next call back
request.meta['item'] = item
return request
def parse(self, response):
# access and do something with the item in parse
item = response.meta['item']
item['other_url'] = response.url
return item
希望有帮助。
于 2013-11-19T22:06:03.120 回答
7
而不是将请求的 URL 存储在某处,并且经过刮擦处理的 URL 的顺序与start_urls
.
通过使用下面,
response.request.meta['redirect_urls']
会给你重定向发生的列表,比如['http://requested_url','https://redirected_url','https://final_redirected_url']
要访问上面列表中的第一个 URL,您可以使用
response.request.meta['redirect_urls'][0]
有关更多信息,请参阅doc.scrapy.org提到:
重定向中间件
This middleware handles redirection of requests based on response status.
redirect_urls
可以在Request.meta键中找到请求经过(重定向时)的 url 。
希望这可以帮助你
于 2017-12-13T12:17:30.023 回答
3
蟒蛇 3.5
刮痧 1.5.0
from scrapy.http import Request
# override method
def start_requests(self):
for url in self.start_urls:
item = {'start_url': url}
request = Request(url, dont_filter=True)
# set the meta['item'] to use the item in the next call back
request.meta['item'] = item
yield request
# use meta variable
def parse(self, response):
url = response.meta['item']['start_url']
于 2018-04-17T08:07:25.707 回答