r - R：rvest提取innerHTML

Question

在 R 中使用rvest来抓取网页，我想从 node中提取等价物innerHTML，特别是在 apply 之前将换行符更改为换行符html_text。

所需功能的示例：

library(rvest)
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
innerHTML(doc, ".pp")

应产生以下输出：

[1] "<p class=\"pp\">First Line<br>Second Line</p>"

有了rvest 0.2这个可以通过toString.XMLNode

# run under rvest 0.2
library(XML)
html('<html><p class="pp">First Line<br />Second Line</p>') %>% 
  html_node(".pp") %>% 
  toString.XMLNode
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

随着更新rvest 0.2.0.900，这不再起作用。

# run under rvest 0.2.0.900
library(XML)
html_node(doc,".pp") %>% 
  toString.XMLNode
[1] "{xml_node}\n<p>\n[1] <br/>"

所需的功能通常在现在依赖的包的功能中可用-write_xml只要可以将其输出提供给变量而不是坚持写入文件。（也不接受 a ）。xml2rvestwrite_xmltextConnection

作为一种解决方法，我可以暂时写入文件：

# extract innerHTML, workaround: write/read to/from temp file
html_innerHTML <- function(x, css, xpath) {
  file <- tempfile()
  html_node(x,css) %>% write_xml(file)
  txt <- readLines(file, warn=FALSE)
  unlink(file)
  txt
}
html_innerHTML(doc, ".pp") 
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

有了这个，我可以例如将换行标记转换为换行符：

html_innerHTML(doc, ".pp") %>% 
  gsub("<br\\s*/?\\s*>","\n", .) %>%
  read_html %>%
  html_text
[1] "First Line\nSecond Line"

有没有更好的方法来使用来自例如、或其他包的rvest现有xml2功能XML？特别是我想避免写入硬盘。

score 2 · Accepted Answer

正如@r2evans 所说，as.character(doc)是解决方案。

关于您的最后一个代码片段，它希望<br>在转换为换行符时从节点中提取 -separated 文本，<br>在当前未解决的rvest 问题 #175，注释 #2中有一个解决方法：

此问题的简化版本：

doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')

# r2evan's solution:
as.character(rvest::html_node(doc, xpath="//p"))
##[1] "<p class=\"pp\">First Line<br>Second Line</p>"

# rentrop@github's solution, simplified:
innerHTML <- function(x, trim = FALSE, collapse = "\n"){
    paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
innerHTML(doc)
## [1] "First Line\nSecond Line"

score -1 · Accepted Answer

这是使用rvest0.3.5 的解决方案：

doc <- xml2::read_html('<html><p class="pp">First Line<br />Second Line</p>')

nodes <- rvest::html_nodes(doc, css = '.pp')
# {xml_nodeset (1)}
# [1] <p class="pp">First Line<br>Second Line</p>

rvest::html_text(nodes)
# [1] "First LineSecond Line"

r - R：rvest提取innerHTML

2 回答 2

Related

Reference