python - 重新登录到 Scrapy 网站以恢复 Scrapy 工作

Question

有没有办法让Scrapy蜘蛛登录到网站以恢复以前暂停的抓取工作？

编辑：澄清一下，我的问题实际上是关于 Scrapy 蜘蛛而不是一般的 cookie。也许一个更好的问题是，当 Scrapy 蜘蛛在工作目录中被冻结后复活时，是否有任何方法被调用。

score -1 · Accepted Answer

是的你可以。

您应该更清楚刮板的确切工作流程。

无论如何，我假设您将在第一次抓取时登录，并希望在恢复抓取时使用相同的 cookie。

你可以使用httplib2库来做这样的事情。这是他们示例页面中的代码示例，为了更清晰，我添加了注释。

import urllib
import httplib2

http = httplib2.Http()

url = 'http://www.example.com/login'   
body = {'USERNAME': 'foo', 'PASSWORD': 'bar'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}

//submitting form data for logging into the website
response, content = http.request(url, 'POST', headers=headers, body=urllib.urlencode(body))

//Now the 'response' object contains the cookie the website sends
//which can be used for visiting the website again

//setting the cookie for the new 'headers'
headers_2 = {'Cookie': response['set-cookie']}

url = 'http://www.example.com/home'   

// using the 'headers_2' object to visit the website,
response, content = http.request(url, 'GET', headers=headers_2)

如果您不清楚 cookie 的工作原理，请进行搜索。简而言之，“Cookies”是一种帮助服务器维护会话的客户端技术。

python - 重新登录到 Scrapy 网站以恢复 Scrapy 工作

1 回答 1

Related

Reference