r - 链接重定向问题 - 使用 Rvest 在 R 中进行 Web Scraping

Question

当我使用工具从新闻网站上抓取链接时Rvest，我经常偶然发现重定向到另一个链接的链接。在这些情况下，我只能抓取第一个链接，而第二个链接是实际包含数据的链接。例如：

library(dplyr)
library(rvest)
scraped.link <- "http://www1.folha.uol.com.br/folha/dinheiro/ult91u301428.shtml"

article.title <- read_html(scraped.link) %>%
      html_nodes('body') %>%
      html_nodes('.span12.page-content') %>%
      html_nodes('article') %>%
      html_nodes('header') %>%
      html_nodes('h1') %>%
      html_text()
article.title
#> character(0)

redirected.link <- "https://www1.folha.uol.com.br/mercado/2007/06/301428-banco-central-volta-a-intervir-no-mercado-para-deter-queda-do-cambio.shtml"

article.title <- read_html(redirected.link) %>%
      html_nodes('body') %>%
      html_nodes('.span12.page-content') %>%
      html_nodes('article') %>%
      html_nodes('header') %>%
      html_nodes('h1') %>%
      html_text()
article.title
#> "Banco Central volta a intervir no mercado para deter queda do câmbio"

有没有办法使用第一个链接获得第二个链接？该网站只保留第一个。

score 1 · Accepted Answer

是的，页面通过 javascript `location.replace' 重定向，因此只需使用正则表达式提取脚本标记的 html 文本中“location.replace”的第一个实例之后的第一个引用项：

library(dplyr)
library(rvest)
scraped.link <- "http://www1.folha.uol.com.br/folha/dinheiro/ult91u301428.shtml"
link.regex   <- "(.*?location[.]replace.*?\")(.*?)(\".*)"

read_html(scraped.link)      %>%
  html_nodes('script')       %>%
  html_text()                %>%
  gsub(link.regex, "\\2", .)  
#> [1] "http://www1.folha.uol.com.br/mercado/2007/06/301428-banco-central-volta-a-intervir-
#> no-mercado-para-deter-queda-do-cambio.shtml"

r - 链接重定向问题 - 使用 Rvest 在 R 中进行 Web Scraping

1 回答 1

Related

Reference