html - 如何在 R 中读取和解析网页的内容

Question

我想在 R 中阅读 URL（eq，http://www.haaretz.com/）的内容。我想知道我该怎么做

score 34 · Accepted Answer

不太确定你想如何处理那个页面，因为它真的很乱。正如我们在这个著名的 stackoverflow 问题中重新学习的那样，在 html 上执行正则表达式并不是一个好主意，因此您肯定希望使用 XML 包来解析它。

这是一个帮助您入门的示例：

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

这导致一个字符向量主要是网页文本（以及一些 javascript）：

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:Â 16:48Â (EST+7)"           
[4] "Â Â Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()"

score 3 · Accepted Answer

3

您最好的选择可能是 XML 包 - 例如，请参阅上一个问题。

于 2009-12-04T04:29:35.223 回答

score 2 · Accepted Answer

I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?

html - 如何在 R 中读取和解析网页的内容

3 回答 3

Related

Reference