r - 从R中的嵌套链接中查找链接

翻译自：https://stackoverflow.com/questions/61207243 2020-04-14T12:04:36.413

75 次

我正在学习使用 R 进行文本挖掘。我正在尝试查找 HTML 文档中的所有链接。

我尝试了 getHTMLLinks() 但它显示以下错误：

url = "https://elections.maryland.gov/elections/2012/election_data/index.html"
getHTMLLinks(url)

character(0)
Warning message:
XML content does not seem to be XML: 'https://elections.maryland.gov/elections/2012/election_data/index.html'

所以我厌倦了“rvest”包来查找链接。代码如下：

links = xml2::read_html(url) %>% #read html link
  html_nodes("a") %>% #select a node
  html_attr("href") %>% #from a node select all href (hyperlink) tags
  .[grep("general.csv",.,ignore.case = T)]

它以矢量格式提供所有链接。

head(links)

"State_Congressional_Districts_2012_General.csv" "State_Legislative_Districts_2012_General.csv"  
[3] "All_By_Precinct_2012_General.csv"               "Allegany_County_2012_General.csv"              
[5] "Allegany_By_Precinct_2012_General.csv"          "Anne_Arundel_County_2012_General.csv"

这些所有链接只是在href标记中列出的名称。但实际上这些都是表格的超链接。

如果有人可以帮助我如何提取最终链接而不是这些超链接的名称，那将是非常棒的？

r - 从R中的嵌套链接中查找链接

0 回答 0

Related

Reference