我正在阅读 R 中的topicmodels教程。在第 12 页左右,它们去除了 HTML 标记和希腊字母:
R> library("XML")
R> remove_HTML_markup <- function(s) {
+ doc <- htmlTreeParse(s, asText = TRUE, trim = FALSE)
+ xmlValue(xmlRoot(doc))
+ }
R> remove_HTML_markup(JSS_papers[1,"description"])
Error: XML content does not seem to be XML, nor to identify a file name ...
JSS_papers
存储与从期刊下载的论文集相关的元数据。标签下的条目description
是文章的摘要。这个没有任何标签:
JSS_papers[1,"description"] = "The fit of a variogram model to spatially-distributed
data is often difficult to assess. A graphical diagnostic written in S-plus is
introduced that allows the user to determine both the general quality of the fit of a
variogram model, and to find specific pairs of locations that do not have measurements
that are consonant with the fitted variogram. It can help identify nonstationarity,
outliers, and poor variogram fit in general. Simulated data sets and a set of soil
nitrogen concentration data are examined using this graphical diagnostic."