1

我有一个包含一个段落的输入文件。我需要按模式将段落分成两个小段。

段落.xml

 <Text>
      This is first line.
      This is second line.
      \delemiter\new\one
      This is third line.
      This is fourth line.
 </Text>

代码:

doc<-xmlTreeParse("paragraph.xml")
top = xmlRoot(doc)
text<-top[[1]]

我需要把这一段分成两段。

第1段

 This is first line.
 This is second line.

第2段

  This is third line.
  This is fourth line.

我发现 strsplit 函数非常有用,但它从不分割多行文本。

4

2 回答 2

2

由于您有 xml 文件,因此最好使用XML包工具。我看到你在这里开始使用它是你开始的连续性。

library(XML)
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE)
## extract the text of the node Text
mytext = xpathSApply(doc,'//Text/text()',xmlValue)
## convert it to a list of lines using scan
lines <- scan(text=mytext,sep='\n',what='character')
## get the delimiter index
delim <- which(lines == "\\delemiter\\new\\one")
## get the 2 paragraphes
p1 <- lines[seq(delim-1)]
p2 <- lines[seq(delim+1,length(lines))]

Then you can use paste or write to get the paragraph structure, for example, using write:

write(p1,"",sep='\n')

This is first line.
This is second line.
于 2013-03-20T06:26:23.153 回答
1

这是一种迂回的可能性,使用splitgreplcumsum

一些样本数据:

temp <- c("This is first line.", "This is second line.", 
          "\\delimiter\\new\\one", "This is third line.", 
          "This is fourth line.", "\\delimiter\\new\\one",
          "This is fifth line")
# [1] "This is first line."   "This is second line."  "\\delimiter\\new\\one"
# [4] "This is third line."   "This is fourth line."  "\\delimiter\\new\\one"
# [7] "This is fifth line"   

通过使用onsplit生成“组”后使用:cumsumgrepl

temp1 <- split(temp, cumsum(grepl("delimiter", temp)))
temp1
# $`0`
# [1] "This is first line."  "This is second line."
# 
# $`1`
# [1] "\\delimiter\\new\\one" "This is third line."   "This is fourth line." 
# 
# $`2`
# [1] "\\delimiter\\new\\one" "This is fifth line"  

如果需要进一步清理,这里有一个选项:

lapply(temp1, function(x) {
  x[grep("delimiter", x)] <- NA
  x[complete.cases(x)]
})
# $`0`
# [1] "This is first line."  "This is second line."
# 
# $`1`
# [1] "This is third line."  "This is fourth line."
# 
# $`2`
# [1] "This is fifth line"
于 2013-03-20T04:58:52.957 回答