r - 使用 RCurl 或 httr 在 R 中自动登录英国数据服务网站

Question

我正在为http://asdfree.com/编写一组可免费下载的 R 脚本，以帮助人们分析由英国数据服务托管的复杂样本调查数据。除了为这些数据集提供大量的统计教程外，我还想自动下载和导入这些调查数据。为此，我需要弄清楚如何以编程方式登录到这个英国数据服务网站。

我已经尝试了很多不同的RCurl和httr配置来登录，但是我在某个地方犯了一个错误，我被卡住了。我已尝试检查本文中概述的元素，但网站在浏览器中跳得太快，我无法理解发生了什么。

该网站确实需要登录名和密码，但我相信我什至在进入登录页面之前就犯了一个错误。

以下是该网站的工作方式：

起始页应为：https ://www.esds.ac.uk/secure/UKDSRegister_start.asp

此页面会自动将您的网络浏览器重定向到一个长 URL，该 URL 开头为：https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]

(1) 由于某种原因，SSL 证书在本网站上不起作用。这是我发布的关于此的 SO 问题。我使用的解决方法只是忽略 SSL：

library(httr)
set_config( config( ssl.verifypeer = 0L ) )

然后我在起始网站上的第一个命令是：

z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )

这给了我一个z$url看起来很像https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]我的浏览器也重定向到的页面。

然后，在浏览器中，您应该输入“uk data archive”并单击continue按钮。当我这样做时，它会将我重定向到网页https://shib.data-archive.ac.uk/idp/Authn/UserPassword

我认为这是我卡住的地方，因为我无法弄清楚如何使用 cURLfollowlocation并登陆该网站。注意：尚未输入用户名/密码。

当我使用来自 wayf.ukfederation.org.uk 页面的httr GET命令时，如下所示：

 y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )

该y$url字符串看起来很像z$url（除了最后有一个组合框=）。有没有办法通过RCurl或httruk data archive进入这个认证页面？

我不知道我是否只是忽略了某些东西，或者我是否绝对必须使用我之前的 SO 帖子中描述的 SSL 证书或什么？

(2) 在我进入该页面时，我相信其余的代码将是：

values <- list( j_username = "your.username" , 
                j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)

但我想该页面将不得不等待......

score 2 · Accepted Answer

表单返回的相关数据变量是actionand origin， not combobox。给出action值selection和origin来自相关条目的值combobox

y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"

编辑

看起来句柄池并没有使您的会话正确地保持活动状态。因此，您需要直接而不是自动传递句柄。同样对于POST您需要设置的命令，multipart=FALSE因为这是HTML 表单的默认设置。R 命令具有不同的默认值，因为它主要用于上传文件。所以：

y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
  Status: 200
  Content-type: text/html

...snipped...    


                <title>

                        Introduction to ESDS

                </title>

                <meta name="description" content="Introduction to the ESDS, home page" />

score 1 · Accepted Answer

我认为解决“进入您的组织”页面的一种方法是这样的：

library(tidyverse)
library(rvest)
library(stringr)

org <- "your_organization"
user <- "your_username"
password <- "your_password"

signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)

# get to org page and enter org
p0 <- html_session(signin) %>% 
    follow_link("Login")
org_link <- html_nodes(p0, "option") %>% 
    str_subset(org) %>% 
    str_match('(?<=\\")[^"]*') %>%
    as.character()

f0 <- html_form(p0) %>%
    first() %>%
    set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
                           type = "submit",
                           value = "Continue",
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button

c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))

不幸的是，这并没有解决整个问题——（2）比看起来更难。我在这里发布了更多我认为的解决方案：R: use rvest (or httr) to login to a site requires cookies。希望有人能帮助我们完成剩下的工作。

r - 使用 RCurl 或 httr 在 R 中自动登录英国数据服务网站

以下是该网站的工作方式：

2 回答 2

Related

Reference