python - 如何在scrapy中获取原始start_url（重定向之前）

Question

我正在使用 Scrapy 抓取一些页面。我从 excel 表中获取 start_urls，我需要将 url 保存在项目中。

class abc_Spider(BaseSpider):
   name = 'abc'
   allowed_domains = ['abc.com']         
   wb = xlrd.open_workbook(path + '/somefile.xlsx')
   wb.sheet_names()
   sh = wb.sheet_by_name(u'Sheet1')
   first_column = sh.col_values(15)
   start_urls = first_column
   handle_httpstatus_list = [404]

   def parse(self, response):
      item = abcspiderItem()
      item['url'] = response.url

问题是该网址被重定向到其他网址（因此在响应网址中提供了其他内容）。如何获取我从 excel 中获得的原始 url？

score 24 · Accepted Answer

您可以在中找到您需要的内容response.request.meta['redirect_urls']。

引用自文档：

请求经过（重定向时）的 url 可以在 redirect_urls Request.meta 键中找到。

希望有帮助。

score 1 · Accepted Answer

这给了我原始的“引用 URL”，即我的哪个 start_urls 导致与该请求对象对应的 URL 被抓取：

req = response.request
req_headers = req.__dict__['headers']
referer_url = req_headers['Referer'].decode('utf-8')

python - 如何在scrapy中获取原始start_url（重定向之前）

2 回答 2

Related

Reference