12

我正在尝试使用stringi包在分隔符上拆分(可能重复分隔符)但保留分隔符。这类似于我之前问过的这个问题:R split on delimiter (split) keep the delimiter (split) but the delimiter can be repeating。我不认为 basestrsplit可以处理这种类型的正则表达式。包可以,stringi但我不知道如何格式化正则表达式,如果有重复,它会在分隔符上拆分,也不要在字符串末尾留下空字符串。

Base R 解决方案、stringr、stringi 等解决方案都受到欢迎。

后来的问题发生了,因为我在 greedy*上使用了\\s但空间没有得到保证,所以我只能想把它留在:

MWE

text.var <- c("I want to split here.But also||Why?",
   "See! Split at end but no empty.",
   "a third string.  It has two sentences"
)

library(stringi)   
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")

# 结果

## [[1]]
## [1] "I want to split here." "But also|"     "|"          "Why?"                 
## [5] ""                     
## 
## [[2]]
## [1] "See!"       "Split at end but no empty." ""                          
## 
## [[3]]
## [1] "a third string."      "It has two sentences"

# 期望的结果

## [[1]]
## [1] "I want to split here." "But also||"                     "Why?"                                  
## 
## [[2]]
## [1] "See!"         "Split at end but no empty."                         
## 
## [[3]]
## [1] "a third string."      "It has two sentences"
4

2 回答 2

8

使用strsplit

 strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
 #[[1]]
 #[1] "I want to split here." "But also||"            "Why?"                 

 #[[2]]
 #[1] "See!"                       "Split at end but no empty."

 #[[3]]
 #[1] "a third string."      "It has two sentences"

或者

 library(stringi)
 stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
 #[[1]]
 #[1] "I want to split here." "But also||"            "Why?"                 

 #[[2]]
 #[1] "See!"                       "Split at end but no empty."

 #[[3]]
 #[1] "a third string."      "It has two sentences"
于 2014-10-22T15:39:52.157 回答
6

只需使用一种模式来查找字符间位置: (1)前面?.!|; 和 (2)后面没有之一?.!|。继续\\s*匹配并吃掉任意数量的连续空格字符,你就可以开始了。

##                  (look-behind)(look-ahead)(spaces)
strsplit(text.var, "(?<=([?.!|]))(?!([?.!|]))\\s*", perl=TRUE)
# [[1]]
# [1] "I want to split here." "But also||"            "Why?"                 
# 
# [[2]]
# [1] "See!"                       "Split at end but no empty."
# 
# [[3]]
# [1] "a third string."      "It has two sentences"
于 2014-10-22T17:10:07.700 回答