52

How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store these urls. I'm using the BaseSpider.

4

5 回答 5

93

传递给 parse() 的“响应”变量具有您想要的信息。您不需要覆盖任何内容。

例如。(已编辑)

def parse(self, response):
    print "URL: " + response.request.url
于 2015-01-25T07:50:23.353 回答
17

可以从响应对象访问请求对象,因此您可以执行以下操作:

def parse(self, response):
    item['start_url'] = response.request.url
于 2015-12-29T03:57:56.733 回答
7

您需要覆盖 BaseSpider 的make_requests_from_url(url) 函数以将 start_url 分配给项目,然后使用Request.meta 特殊键将该项目传递给parse函数

from scrapy.http import Request

    # override method
    def make_requests_from_url(self, url):
        item = MyItem()

        # assign url
        item['start_url'] = url
        request = Request(url, dont_filter=True)

        # set the meta['item'] to use the item in the next call back
        request.meta['item'] = item
        return request


    def parse(self, response):

        # access and do something with the item in parse
        item = response.meta['item']
        item['other_url'] = response.url
        return item

希望有帮助。

于 2013-11-19T22:06:03.120 回答
7

而不是将请求的 URL 存储在某处,并且经过刮擦处理的 URL 的顺序与start_urls.

通过使用下面,

response.request.meta['redirect_urls']

会给你重定向发生的列表,比如['http://requested_url','https://redirected_url','https://final_redirected_url']

要访问上面列表中的第一个 URL,您可以使用

response.request.meta['redirect_urls'][0]

有关更多信息,请参阅doc.scrapy.org提到:

重定向中间件

This middleware handles redirection of requests based on response status.

redirect_urls 可以在Request.meta键中找到请求经过(重定向时)的 url 。

希望这可以帮助你

于 2017-12-13T12:17:30.023 回答
3

蟒蛇 3.5

刮痧 1.5.0

from scrapy.http import Request

# override method
def start_requests(self):
    for url in self.start_urls:
        item = {'start_url': url}
        request = Request(url, dont_filter=True)
        # set the meta['item'] to use the item in the next call back
        request.meta['item'] = item
        yield request

# use meta variable
def parse(self, response):
    url = response.meta['item']['start_url']
于 2018-04-17T08:07:25.707 回答