我想使用 R 从这里抓取维基媒体类别树中包含的链接和树的结构。下面的代码可以打开所有可折叠的要点
library(RSelenium)
rD <- rsDriver(check = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://commons.wikimedia.org/wiki/Category:Sports")
n <- 1
# n <- 10 # takes a long time to expand all bullet points
for(i in 1:n){
b <- remDr$findElements(using = "css selector", "[title='expand']")
for(i in 1:length(b)){
b[[i]]$clickElement()
}
}
...但我正在努力建立一个看起来像...的数据库
我可以使用下面的代码获取项目符号的 href 和名称,但我正在努力寻找一种方法来指示每个项目符号点所指的级别(即每个项目符号点在类别树中的深度)?我在想可能有一个聪明的 xpath 方法来计算CategoryTreeChildren
每个子弹的深度,但这远远超出了我的能力。
# for testing I manually expand the bullets for the first couple of branches
# (fully for Bulgaria women badminton, basketball) and the last possible
# branch rather than let the for loop run and run through multiple cycles.
library(tidyverse)
library(rvest)
s <- remDr$getPageSource()
d <- read_html(s[[1]]) %>%
html_nodes("div#mw-subcategories") %>%
html_nodes("div.CategoryTreeItem") %>%
html_nodes("a") %>%
map(xml_attrs) %>%
map_df(~as.list(.)) %>%
as_tibble()
# > d
# # A tibble: 135 x 2
# href title
# <chr> <chr>
# 1 /wiki/Category:Categories_by_sport Category:Categories by sport
# 2 /wiki/Category:Categories_by_sport_by_c~ Category:Categories by sport by co~
# 3 /wiki/Category:Categories_of_Bulgaria_b~ Category:Categories of Bulgaria by~
# 4 /wiki/Category:Female_sportspeople_from~ Category:Female sportspeople from ~
# 5 /wiki/Category:Female_badminton_players~ Category:Female badminton players ~
# 6 /wiki/Category:Maria_Delcheva Category:Maria Delcheva
# 7 /wiki/Category:Petya_Nedelcheva Category:Petya Nedelcheva
# 8 /wiki/Category:Gabriela_Stoeva Category:Gabriela Stoeva
# 9 /wiki/Category:Stefani_Stoeva Category:Stefani Stoeva
# 10 /wiki/Category:Women%27s_basketball_pla~ Category:Women's basketball player~
我也玩过WikipediR包 - 它在包描述中说它可用于检索类别树的元素,但我找不到如何实现它的示例。