r - R 在 readHTMLTable 调用维基百科时崩溃

Question

试图刮掉维基百科页面，我以前做过很多次这样的事情

library(XML)
myURL <- "http://en.wikipedia.org/wiki/List_of_US_Open_MenUs_Singles_champions"
y <- readHTMLTable(myURL,  stringsAsFactors = FALSE)

R 在 RStudio 或标准 GUI 中崩溃

其他关于类似问题的 SO 评论建议使用 readLines

u=url(myURL)
readLines(u) #  cannot open: HTTP status was '404 Not Found'

该网址实际上已重定向，因此输入了最终网址

myURL <- "http://en.wikipedia.org/wiki/List_of_US_Open_Men%27s_Singles_champions"

这次 readLines 确实输出了页面，但是使用 XML 函数，包括 htmlParse，仍然会导致崩溃

TIA

score 3 · Accepted Answer

我发现该软件包httr在解决任何网络抓取问题方面非常宝贵。在这种情况下，您需要添加用户代理配置文件，因为如果您不这样做，维基百科会阻止内容：

library(httr)
library(XML)
myURL <- "http://en.wikipedia.org/wiki/List_of_US_Open_Men%27s_Singles_champions"
page <- GET(myURL, user_agent("httr"))
x <- readHTMLTable(text_content(page), as.data.frame=TRUE)
head(x[[1]])

产生这个：

  US Open Men's Singles Champions                                                          NA
1                Official website                                                        <NA>
2                        Location                        Queens – New York City United States
3                           Venue                USTA Billie Jean King National Tennis Center
4                  Governing body                                                        USTA
5                         Created 1881 (established)Open Era: 1968\n(44 editions, until 2011)
6                         Surface  Grass (1881–1974)HarTru (1975–1977)DecoTurf (1978–Present)

r - R 在 readHTMLTable 调用维基百科时崩溃

1 回答 1

Related

Reference