0

对于此链接中的条目,我需要单击每个条目,然后单击页面左下方的 excel 文件路径的爬虫 url:

在此处输入图像描述

我如何使用 R 中的 web scrapy 包来实现这一点,例如rvest等?提前真诚感谢。

library(rvest)

# Start by reading a HTML page with read_html():
common_list <- read_html("http://www.csrc.gov.cn/csrc/c100121/common_list.shtml")
common_list %>%
  # extract paragraphs
  rvest::html_nodes("a") %>%
  # extract text
  rvest::html_text() -> webtxt
# inspect
head(webtxt)

首先,我的问题是如何正确设置html_nodes以获取每个网页的 url?

在此处输入图像描述

更新:

> driver
$client
[1] "No sessionInfo. Client browser is mostly likely not opened."

$server
PROCESS 'file105483d2b3a.bat', running, pid 37512.
> remDr
$remoteServerAddr
[1] "localhost"

$port
[1] 4567

$browserName
[1] "chrome"

$version
[1] ""

$platform
[1] "ANY"

$javascript
[1] TRUE

$nativeEvents
[1] TRUE

$extraCapabilities
list()

当我运行时remDr$navigate(url)

Error in checkError(res) : 
  Undefined error in httr call. httr output: length(url) == 1 is not TRUE
4

1 回答 1

1

用于rvest获取链接,

library(rvest)
library(dplyr)
library(RSelenium)

link <- url %>%
  read_html() %>%  
  html_nodes('.mt10')

link <- link[[2]] %>% 
  html_nodes("a") %>% 
  html_attr('href') %>% paste0('http://www.csrc.gov.cn', .)

 [1] "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"                         
 [2] "http://www.csrc.gov.cn/csrc/c101921/c1714636/content.shtml"                         
 [3] "http://www.csrc.gov.cn/csrc/c101921/c1664367/content.shtml"                         
 [4] "http://www.csrc.gov.cn/csrc/c101921/c1657437/content.shtml"                         
 [5] "http://www.csrc.gov.cn/csrc/c101921/c1657426/content.shtml"     
       

我们可以使用RSelenium循环链接和下载 excel 文件。我花了一分钟才完全加载一个网页。我将使用单个链接进行演示。

url <- "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"
# launch the browser
driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]

# click on the excel file path
remDr$navigate(url)
remDr$findElement('xpath', '//*[@id="files"]/a')$clickElement()
于 2022-01-11T07:07:32.253 回答