xml - 如何在 R（https 链接）中抓取安全页面（使用 XML 包中的 readHTMLTable）？

Question

关于如何使用 XML 包中的 readHTMLTable 有很好的答案，我用常规的 http 页面做到了，但是我无法解决 https 页面的问题。

我正在尝试阅读此网站上的表格（网址字符串）：

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

但我收到此错误：文件https ://ned.nih.gov/search/Vi...不存在。

我试图通过这个（下面的前 2 行）解决 https 问题（从使用谷歌找到解决方案（比如这里：http ://tonybreyal.wordpress.com/2012/01/13/ra-quick-scrape-of -来自 boxofficemojo-com/ 的票房最高的电影）。

这个技巧有助于查看更多页面，但任何提取表格的尝试都不起作用。任何建议表示赞赏。我需要像组织、组织标题、经理这样的表字段。

 #attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

score 28 · Accepted Answer

新包httr提供了一个包装器RCurl，可以更轻松地抓取各种页面。

尽管如此，这个页面还是给我带来了很多麻烦。以下工作，但毫无疑问，有更简单的方法来做到这一点。

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

结果：

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

httr到这里：http : //cran.r-project.org/web/packages/httr/index.html

编辑：RCurl关于包的常见问题的有用页面： http ://www.omegahat.org/RCurl/FAQ.html

score 4 · Accepted Answer

使用 Andrie 克服 https 的好方法

下面还提供了一种无需 readHTMLTable 即可获取数据的方法。

A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile, ssl.verifypeer = FALSE)
)

h = htmlParse(page)
ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
ns

I still need to extract the IDs behind the hyperlinks.

for example instead of collen baros as manager, I need to get to the ID 0010080638

Manager:Colleen Barros

score 0 · Accepted Answer

This is the function I have to deal with this problem. Detects if https in url and uses httr if it is.

readHTMLTable2=function(url, which=NULL, ...){
 require(httr)
 require(XML)
 if(str_detect(url,"https")){
    page <- GET(url, user_agent("httr-soccer-ranking"))
    doc = htmlParse(text_content(page))
    if(is.null(which)){
      tmp=readHTMLTable(doc, ...)
      }else{
        tableNodes = getNodeSet(doc, "//table")
        tab=tableNodes[[which]]
        tmp=readHTMLTable(tab, ...) 
      }
  }else{
    tmp=readHTMLTable(url, which=which, ...) 
  }
  return(tmp)
}

xml - 如何在 R（https 链接）中抓取安全页面（使用 XML 包中的 readHTMLTable）？

3 回答 3

Related

Reference