python - 使用 Python 请求库获取带有 cookie 的页面

Question

我只是在研究请求库（http://docs.python-requests.org/en/latest/），并遇到了如何使用请求获取带有 cookie 的页面的问题。

例如：

url2= 'https://passport.baidu.com'
parsedCookies={'PTOKEN': '412f...', 'BDUSS': 'hnN2...', ...} #Sorry that the cookies value is replaced by ... for instance of privacy
req = requests.get(url2, cookies=parsedCookies)
text=req.text.encode('utf-8','ignore')
f=open('before.html','w')
f.write(text)
f.close()
req.close()

当我使用上面的代码来获取页面时，它只是将登录页面保存到'before.html'而不是登录页面，它指的是实际上我没有成功登录。

但是如果我使用 URLlib2 来获取页面，它会按预期正常工作。

parsedCookies="PTOKEN=412f...;BDUSS=hnN2...;..." #Different format but same content with the aboved cookies
req = urllib2.Request(url2)
req.add_header('Cookie', parsedCookies)
ret = urllib2.urlopen(req)
f=open('before_urllib2.html','w')
f.write(ret.read())
f.close()
ret.close()

当我使用这些代码时，它会将登录页面保存在before_urllib2.html.

--

我的代码有错误吗？任何答复将不胜感激。

score 2 · Accepted Answer

您可以使用 Session 对象来获得您想要的：

url2='http://passport.baidu.com'
session = requests.Session()  # create a Session object 
cookie = requests.utils.cookiejar_from_dict(parsedCookies) 
session.cookies.update(cookie) # set the cookies of the Session object

req = session.get(url2, headers=headers,allow_redirects=True)

如果您使用 requests.get 函数，它不会为重定向页面发送 cookie。相反，如果您使用 Session().get 函数，它将为所有 http 请求维护和发送 cookie，这就是“会话”概念的确切含义。

让我尝试向您详细说明这里发生的事情：

当我发送cookiehttp://passport.baidu.com/center并将参数allow_redirects设置为false时，返回的状态码为302，响应的标题之一是'location'：'/center?_t=1380462657'（这是服务器生成的动态值，您可以将其替换为从服务器获得的内容）：

url2= 'http://passport.baidu.com/center'
req = requests.get(url2, cookies=parsedCookies, allow_redirects=False)
print req.status_code # output 302
print req.headers

但是当我将参数allow_redirects 设置为True 时，它仍然不会重定向到页面（http://passport.baidu.com/center?_t=1380462657）并且服务器返回登录页面。原因是requests.get没有为重定向页面发送cookie，这里是http://passport.baidu.com/center?_t=1380462657，所以我们可以成功登录。这就是我们需要 Session 对象的原因。

如果我设置url2 = http://passport.baidu.com/center?_t=1380462657，它将返回您想要的页面。一种解决方案是使用上述代码获取动态位置值并形成您帐户的路径，例如http://passport.baidu.com/center?_t=1380462657，然后您可以获得所需的页面。

url2= 'http://passport.baidu.com' + req.headers.get('location')
req = session.get(url2, cookies=parsedCookies, allow_redirects=True )

但这很麻烦，所以在处理 cookie 时，Session 对象为我们做得很好！

python - 使用 Python 请求库获取带有 cookie 的页面

1 回答 1

Related

Reference