python - 使用登录页面抓取网站

Question

我目前使用以下脚本从网站登录时间。

browser = webdriver.Chrome('E:/Shared Folders/Users/runnerjp/chromedriver/chromedriver.exe')
browser.get("https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F") 
time.sleep(3)
username = browser.find_element_by_id("EmailAddress")
password = browser.find_element_by_id("Password")
username.send_keys("usr")
password.send_keys("pass")
login_attempt = browser.find_element_by_xpath("//input[@type='submit']")
time.sleep(3)
login_attempt.submit()

它有效，但我发现使用 Chrome 网络驱动程序正在锤击我的 CPU。是否有我可以使用的替代代码并不意味着我需要物理加载页面才能登录？

score 5 · Accepted Answer

这里的所有答案都有一些优点，但这取决于被抓取的网站类型以及它如何验证登录。
如果网页通过 javascript/ajax 请求等生成其部分或全部内容，那么使用 selenium 是唯一的方法，因为这允许执行 javascript。但是，要将 cpu 使用率降至最低，您可以使用“无头”浏览器，例如 phantomjs。phantomjs 使用与 chrome 相同的 html 引擎和 javascript 引擎，因此您可以使用 chrome 测试您的代码，并在最后切换。

如果页面的内容是“静态的”，那么您可以使用该requests模块。但是，执行此操作的方法将取决于网页是否使用嵌入到 http 协议中的“基本”身份验证（大多数情况下不使用），在这种情况下：

import requests
requests.get('https://api.github.com/user', auth=('user', 'pass'))

正如 CodeMonkey 所建议的那样

但是如果它使用其他东西，您将不得不分析登录表单以查看发布请求发送到的地址，并使用该地址构建请求，并将用户名/密码放入带有元素 ID 的字段中形式。

score 2 · Accepted Answer

改为使用requests。您可以使用它来登录：

import requests
requests.get('https://api.github.com/user', auth=('user', 'pass'))

更多信息在这里： http ://docs.python-requests.org/en/master/user/authentication/

score 2 · Accepted Answer

你可以使用TestCafe。

TestCafe 是用于 Web 功能测试（e2e 测试）的免费开源框架。TestCafe 基于 Node.js，根本不使用 WebDriver。

TestCafe 驱动的测试在服务器端执行。为了获取 DOM 元素，TestCafe 提供了强大灵活的 Selector 系统。TestCafe 可以使用 ClientFunction 功能在测试网页上执行 JavaScript（请参阅我们的文档）。

TestCafe 测试真的很快，你自己看看。但由于内置智能等待系统，高速试运行不影响稳定性。

TestCafe 的安装非常简单：

1) 检查您的 PC 上是否有 Node.js（或安装它）。

2）安装TestCafe打开cmd并输入：

npm install -g testcafe

写作考试不是一门火箭科学。这是一个快速入门：1）将以下代码复制粘贴到您的文本编辑器并将其保存为“test.js”</p>

import { Selector } from ‘testcafe’;

fixture `Getting Started`
    .page `http://devexpress.github.io/testcafe/example`;

test(‘My first test’, async t => {
    await t
        .typeText(‘#developer-name’, ‘John Smith’)
        .click(‘#submit-button’)
        .expect(Selector(‘#article-header’).innerText).eql(‘Thank you, John Smith!‘);
});

2) 通过在 cmd 中键入以下命令在浏览器（例如 chrome）中运行测试：

testcafe chrome test.js

3) 在控制台输出中获取描述性结果。

TestCafe 允许您针对各种浏览器进行测试：本地、远程（在设备上，无论是用于 Raspberry Pi 的浏览器还是用于 iOS 的 Safari）、云（例如 Sauce Labs）或无头（例如 Nightmare）。这意味着您可以轻松地将 TestCafe 与您的持续集成基础设施一起使用。

You can use the same to scrape data and save to file easily

score 0 · Accepted Answer

你可以使用 mechanize，在我的旧笔记本上我花了 3.22 秒来登录和解析网站。

from mechanize import Browser
import time    #just to check elapsed time and check performance
started_time = time.time()

browser = Browser()
url = 'https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F'
browser.open(url)
browser.select_form(nr = 0)
browser["EmailAddress"] = 'putyouremailhere'
browser["Password"] = 'p4ssw0rd'

logged = browser.submit()
redirected_url = logged.read()
print redirected_url

#you can delete this section:
elapsed_time = time.time() - started_time
print elapsed_time,' seconds'

我希望它有帮助！:)

score 0 · Accepted Answer

您可以按照以下几种方法进行操作：

如果您真的需要完整的 selenium 功能（Javascript 等），请尝试使用无头浏览器驱动程序（即ghostdriver），但是，您不会像选择第二种方式（下图）那样节省尽可能多的 cpu 时间
你可以使用一些轻量级的工具，比如robobrowser (py3)、mechanize或browserplus ，而不是相当重的 selenium 。这样可以节省大量的 CPU 时间，但是它们不支持 javascript 并且缺少 selenium 提供的一些高级功能。

score 0 · Accepted Answer

我推荐你https://scrapy.org/。它在引擎盖下使用扭曲，因此非常有效。

如果你需要执行 JavaScript，还有 scrapy-splash 包：https ://github.com/scrapy-plugins/scrapy-splash 。

Scrapy FAQ 中有关于登录页面的特殊页面：https ://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

score 0 · Accepted Answer

使用无头浏览器会显着减少 CPU 和内存消耗，请尝试使用Chrome 的PhantomJS坚持。这里有一篇关于使用 PhantomJS 和 selenium 的不错的博客文章：

https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/

score 0 · Accepted Answer

另一种选择是“抓取”模块：

from grab import Grab

g = Grab()
g.go('https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F')
g.doc.set_input('EmailAddress','some@email.com')
g.doc.set_input('Password','somepass')
g.doc.submit()

print g.doc.body

score 0 · Accepted Answer

是的，除了 selenium 或 chromium，我应该说除了无头浏览器之外，您应该使用 http 的概念（调用 url）。

requests 和 urllib 模块将在这里提供帮助。

为此，您需要确定方法的参数和类型。一旦确定了调用 url 所需的东西，就可以使用 request 或 urllib。您还需要跟踪您得到或将得到的响应类型。

这是Requests的好文档

使用请求的示例：

案例：这里我们提交一个表单，它有 2 个字段 id 和 pwd，form 中指定的方法是 post，forms 中指定的名称是 user_id 和 user_pwd，分别对应 id 和 pwd。单击按钮时，它正在调用“某些网址”

dataToSend = {'user_id':'id you want to pass', 'user_pwd':'specify pwd here'}
# Here you can specify headers and cookie, specify if required 
response = requests.post(url, data=dataToSend, headers={'content-type':'specify if required', 'user-agent':'chrome...'})

if(response.status_code == 200):
     contentReceived = response.content
     # Here you need to observe the received content, most of the time content will be in json format, so you need to decode here.
     if(contentReceived == 'Response is same that you have expected'):
          print "Successfully"
     else:
          print "Failed"
else:
     print "Failed"

请参阅我关于如何使用请求、cookie 和硒的其他答案。

python - 使用登录页面抓取网站

9 回答 9

Related

Reference