xml - 一个特定于 R 的简单 xml 解析器

Question

我已经阅读了关于为什么永远不要在 {HT,X}ML 上使用正则表达式的 SO 问题，例如这个 - Regex to Indent an XML File，但我想我会发布一个我写的函数，它除了根据从属级别缩进 XML 行。

为了符合 SO 的指导方针，我会冒险 - 调整我的解决方案 :-) ，所以 -

当我开始使用这个函数来格式化一些不知名的坏人发送给我的没有任何缩进的 XML 文件时，会出现什么问题？

xmlit <- function(x,indchar = '\t'){
# require x to be a vector of char strings, one
# per line of the XML file.  
# Add an indent for every line below one starting "<[!/]" and
# remove an indent for every line below "</" 

indit <-''
y<-vector('character',length(x))
for(j in 1:length(x) ) {
# first add whatever indent we're up to
    y[j] <- paste(indit,x[j],collapse='',sep='')
    # check for openers: '<' but not '</' or '/>'
  if( grepl('<[^/?!]' ,x[j]) & !grepl('/>', x[j]) & !grepl('</',x[j]) ) {
            indit<-paste(indit,indchar,collapse='',sep='')
  } else {
   # check for closers: '</' 
    if( grepl('<[/]' ,x[j]) & !grepl('<[^/?!]',x[j])  ) {
# move existing line back out one indent
        y[j]<- substr(y[j],2,1000)
        indit<-substr(indit,2,1000)
    }
}
}
# Note that I'm depending on every level to have a matching closer,
# and that in particular the very last line is a closer.
return(invisible(y))
}

score 0 · Accepted Answer

还有一个假设是任何开始标签都必须是一行的第一件事。如果不是，则有问题：

> cat(xmlit(c("<begin>","<foo/><begin>","</begin>","</begin>")), sep="\n")
<begin>
        <foo/><begin>
</begin>
/begin>

对于一些对（附加）结构有足够假设的 XML 子集，正则表达式可以工作。但是如果假设被违反，那么，这就是为什么会有解析器。

xml - 一个特定于 R 的简单 xml 解析器

1 回答 1

Related

Reference