r - 在 R 中使用 Selenium 包抓取锚定网站

Question

我对 R 相当陌生，并且在从福布斯网站提取数据时遇到了麻烦。

我目前的功能是：

网址 =

http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states

数据 = readHTMLTable(url)

但是，福布斯网站在链接中以“#”符号锚定。为了解析我想要的数据，我下载了 rselenium 包，但我对 reselenium 并不精通。

有没有人对 reselenium 有任何建议/专业知识，以及如何使用 reselenium 从福布斯提取数据？理想情况下，我想从网站的第 1、2 页等中提取数据。

谢谢！

score 4 · Accepted Answer

Or another way using the API used to populate the webpage. This downloads all 2000 companies at one time.

library(httr)
library(RJSONIO)
url <- "http://www.forbes.com/ajax/load_list/"
query <- list("type" = "organization",
              "uri" = "global2000",
              "year" = "2014")
response <- httr::GET(url, query=query)
dat_string <- as(response, "character")
dat_list <- RJSONIO::fromJSON(dat_string, asText=TRUE)
df <- data.frame(rank = sapply(dat_list, "[[", 1),
                 company = sapply(dat_list, "[[", 3),
                 country=sapply(dat_list, "[[", 10),
                 sales=sapply(dat_list, "[[", 6),
                 profits=sapply(dat_list, "[[", 7),
                 assets=sapply(dat_list, "[[", 8),
                 market_value=sapply(dat_list, "[[", 9), stringsAsFactors=F)
df <- df[order(df$rank),]

score 1 · Accepted Answer

这有点 hacky，但这是我使用 rvest 和 read.delim 的解决方案......

library(rvest)

url <- "http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states"
a <- html(url) %>%
  html_nodes("#thelist") %>%
  html_text()
con <- textConnection(a)
df <- read.delim(con, sep="\t", header=F, skip=12, stringsAsFactors=F)
close(con)
df$V1[df$V1==""] <- df$V3[df$V1==""]
df$V2 <- df$V3 <- NULL
df <- subset(df, V1!="")
df$index <- 1:nrow(df)
df2 <- data.frame(company=df$V1[df$index%%6==1],
                  country=df$V1[df$index%%6==2],
                  sales=df$V1[df$index%%6==3],
                  profits=df$V1[df$index%%6==4],
                  assets=df$V1[df$index%%6==5],
                  market_value=df$V1[df$index%%6==0])

r - 在 R 中使用 Selenium 包抓取锚定网站

2 回答 2

Related

Reference