r - 在 R 中专门使用 Rvest 和 Glue 包进行抓取

Question

我正在尝试使用 rvest 和胶水包来抓取多页体育数据。我在嵌套时遇到问题，我认为这是因为网站上的表格有一个两行标题（有些标题是一行，有些是两行）。这是我开始使用的代码。我检查以确保该站点允许使用 python 进行抓取并且那里一切正常。

library(tidyverse) 
library(rvest) # interacting with html and webcontent
library(glue)

网页：https ://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1

刮取选定周 1:17 和位置 1:4 的函数：

salary_scrape_19 <- function(week, position) {

Sys.sleep(3)  

cat(".")

url <- glue("https://fantasy.nfl.com/research/scoringleaders?position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
read_html(url) %>% 
    html_nodes("table") %>% 
    html_table() %>%
    purrr::flatten_df() %>% 
    #set_names(need to clean headers before I can set this)
}

scraped_df <- scaffold %>% 
mutate(data = map2(week, position, ~salary_scrape_19(.x, .y))) 

scraped_df

最终，我想构建一个抓取函数来获取 2019 年所有周内具有相同列的所有位置，即 QB、RB、WR 和 TE。（最终想添加第三个变量来粘合 {year}，但需要首先得到这个。

再一次，我认为这个问题与网站上表格的不稳定标题有关，因为有些是一行，而其他标题是两行。

score 0 · Accepted Answer

我们可以将第一行作为列名粘贴到原始列中，然后删除该行。

library(tidyverse)
library(rvest)

salary_scrape_19 <- function(week, position) {

  url <- glue::glue("https://fantasy.nfl.com/research/scoringleaders?position={position}&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek={week}")
  read_html(url) %>% 
    html_nodes("table") %>% 
    html_table() %>%
    .[[1]] %>%
    set_names(paste0(names(.), .[1, ])) %>%
    slice(-1) 
}

然后我们可以使用map2来抓取不同的数据week和position。

在样本数据上尝试

scaffold <- data.frame(week = c(1, 2), position = c(1, 2))
scraped_df <- scaffold %>% mutate(data = map2(week, position, salary_scrape_19))

r - 在 R 中专门使用 Rvest 和 Glue 包进行抓取

1 回答 1

Related

Reference