r - 在 R 中将 HTML 解析为具有 Div 级别的文本

Question

library(XML)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
doc.html = htmlTreeParse(html, useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//div', xmlValue))

由于 div 级别/结构，上面的代码读取文本两次，我只需要读取一次文本。感谢您的时间和帮助。IE

doc.text[2] # 包含在 3 到 59 中再次重复的所有文本

score 1 · Accepted Answer

试试这个：

library(rvest)
library(tidyverse)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
text <- html %>% 
         html_nodes(xpath = "//text/div") %>%
         html_text(trim = TRUE) %>% 
         paste( collapse = ' ')

r - 在 R 中将 HTML 解析为具有 Div 级别的文本

1 回答 1

Related

Reference