r - R - 带有未关闭的 xml 节点的 rvest (webscraping)，这里：html_nodes("br") 的问题

Question

我使用 rvest 使用以下代码提取网页的一部分（编辑：此网页）：

library('rvest')
webpage <- read_html(url("https://www.tandfonline.com/action/journalInformation?show=editorialBoard&journalCode=ceas20"))
people <- webpage %>%
  html_nodes(xpath='//*[@id="8af55cbd-03a5-4deb-9086-061d8da288d1"]/div/div/div') %>%
  html_nodes(xpath='//p')

结果存储在一个名为 xml_nodeset 中people：

> people
{xml_nodeset (11)}
 [1] <p> <b>Editors:</b> <br> Dr Xyz Anceschi - <i>University of Glasgow <a href="http://www.gla.ac.uk/schools/soci ...
 [2] <p> <b>Editorial Board:</b> <br> Dr Xyz Aliyev - <i>University of Glasgow</i> <br> Professor Richard Berry < ...
 [3] <p> <b>Board of Management:</b> <br> Professor Xyz Berry (Chair) <i>- University of Glasgow</i> <br> Profes ...
 [4] <p> <b>National Advisory Board:</b> <br> Dr Xyz Badcock <i>- University of Nottingham</i> <br> Professor Cath ...

在people中，每个元素都包含关注的人的各种名称 （但是，未封闭：没有）。

我试图使用此代码解析每个人，但它不起作用：

sapply(people,
    function(x)
    {
        x %>%
        html_nodes("br") %>%
        html_text()
    }
)

它只给了我一个空结果列表：

[[1]]
 [1] "" ""

[[2]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

[[3]]
 [1] "" "" "" "" ""

[[4]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

我假设错误是基于 xml_nodeset 中未关闭的节点这一事实。会是这样吗？

如果是这样，我还能做些什么来提取每个人people吗？

score 1 · Accepted Answer

您可以使用str_match_all获取 和之间出现的所有名称。

unlist(sapply(stringr::str_match_all(people, '<br> (.*?)\\s?-?\\s<i>'), 
              function(x) x[, 2]))

# [1] "Dr Luca Anceschi"                "Professor David J. Smith"       
# [3] "Dr Huseyn Aliyev"                "Professor Richard Berry"        
# [5] "Dr Maud Bracke"                  "Dr Eamonn Butler"               
# [7] "Dr Ammon Cheskin"                "Dr Sai Ding"                    
# [9] "Professor Jane Duckett"          "Professor Rick Fawn"  
#...
#...

r - R - 带有未关闭的 xml 节点的 rvest (webscraping)，这里：html_nodes("br") 的问题

1 回答 1

Related

Reference