python - 抓取需要登录的网站

Question

我正在尝试用 BeautifulSoup 抓取一个网站。有问题的网站需要我登录。请查看我的代码。

from bs4 import BeautifulSoup as bs
import requests
import sys

user = 'user'
password = 'pass'

# Url to login page
url = 'main url'

# Starts a session
session = requests.session(config={'verbose': sys.stderr})

login_data = {
'loginuser': user,
'loginpswd': password,
'submit': 'login',
}

r = session.post(url, data=login_data)

# Accessing a page to scrape
r = session.get('specific url')
soup = bs(r.content)

我在这里看到了一些线程后想出了这段代码，所以我想它应该是有效的，但打印的内容仍然就像我被注销一样。

当我运行此代码时，将打印：

2013-05-10T22:49:45.882000   POST   >the main url to login<
2013-05-10T22:49:46.676000   GET    >error page of the main url page as if the logging in failed<
2013-05-10T22:49:46.761000   GET    >the specific url<

当然，登录详细信息是正确的。需要一些帮助的家伙。

@编辑

我将如何在上面实现标题？

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

score 3 · Accepted Answer

首先，您不应该使用任何早于 1.2.0 的请求版本。如果您发现错误（您可能会），我们根本不会支持它们。

其次，您可能正在寻找的是：

import requests
from requests.packages.urllib3 import add_stderr_logger

add_stderr_logger()
s = requests.Session()

s.headers['User-Agent'] = 'Mozilla/5.0'

# after examining the HTML of the website you're trying to log into
# set name_form to the name of the form element that contains the name and
# set password_form to the name of the form element that will contain the password
login = {name_form: username, password_form: password}
login_response = s.post(url, data=login)
for r in login_response.history:
    if r.status_code == 401:  # 401 means authentication failed
        sys.exit(1)  # abort

pdf_response = s.get(pdf_url)  # Your cookies and headers are automatically included

我评论了代码来帮助你。您也可以尝试@FastTurtle 的使用 HTTP Basic Auth 的建议，但是如果您首先尝试发布到表单，您可以继续尝试按照我上面描述的方式进行操作。还要确保loginuser和loginpswd是正确的表单元素名称。如果不是，那可能是这里的潜在问题。b

score 1 · Accepted Answer

该requests模块支持多种类型的身份验证。运气好的话，您尝试解析的网站使用 HTTP Basic Auth，在这种情况下，发送凭据非常容易。

此示例取自requests 网站。您可以在此处阅读有关使用请求和标头进行身份验证的更多信息。

s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent
s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

python - 抓取需要登录的网站

2 回答 2

Related

Reference