r - 如何选择特定“目标”的网页的“href”？

Question

<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">

我只是想提取“href”（例如上面的HTML标签），以便将它与本网站的域名“https://kurier.at”连接起来，并抓取主页上的所有文章。

我尝试了以下代码

library(rvest)
library(lubridate)


kurier_wbpg <- read_html("https://kurier.at")

# I just want the "a" tags which come with the attribute "_self" 

articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]")  %>% 
html_attr("href")%>% 
paste("https://kurier.at",.,sep = "")

当我执行到上述代码块的 html_attr("href") 部分时，我得到的结果是

character(0)

我认为选择 HTML 元素标签有问题。我需要一些帮助吗？

score 1 · Accepted Answer

您需要将 css 缩小到第二个预告块图像，您可以使用类的命名约定来做到这一点。您可以使用url_absolute()添加域。

library(rvest)
library(magrittr)

url <- 'https://kurier.at/'
result <- read_html(url) %>% 
  html_element('.teasers-2 .image') %>% 
  html_attr('href') %>% 
  url_absolute(url)

获得所有预告片的相同原则：

results <- read_html(url) %>% 
  html_elements('.teaser .image') %>% 
  html_attr('href') %>% 
  url_absolute(url)

不确定您是否想要包含 5 个的底部块。如果是这样，您可以再次使用类

articles <- read_html(url) %>% 
  html_elements('.teaser-title') %>% 
  html_attr('href') %>% 
  url_absolute(url)

score 0 · Accepted Answer

它适用于xpath-

library(rvest)

kurier_wbpg <- read_html("https://kurier.at")

articleLinks  <- kurier_wbpg %>% 
  html_elements("a") %>%
  html_elements(xpath = '//*[@target="_self"]') %>%
  html_attr('href') %>%
  paste0("https://kurier.at",.)

articleLinks

# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...

r - 如何选择特定“目标”的网页的“href”？

2 回答 2

Related

Reference