8

假设我有一个字符串,例如以下。

x <- 'The world is at end. What do you think?   I am going crazy!    These people are too calm.'

我只需要在标点符号!?.和后面的空格上拆分并保留标点符号。

这会删除标点符号并在拆分部分中留下前导空格

vec <- strsplit(x, '[!?.][:space:]*')

如何拆分离开标点符号的句子?

4

5 回答 5

14

PCRE您可以通过使用perl=TRUE并使用后向断言来打开。

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)

正则表达式:

(?<!          look behind to see if there is not:
 [^!?.]       any character except: '!', '?', '.'
)             end of look-behind
\s+           whitespace (\n, \r, \t, \f, and " ") (1 or more times)

现场演示

于 2013-11-01T03:12:43.793 回答
6

qdap 包中的sentSplit函数是专门为此任务创建的:

library(qdap)
sentSplit(data.frame(text = x), "text")

##   tot                       text
## 1 1.1       The world is at end.
## 2 2.2         What do you think?
## 3 3.3          I am going crazy!
## 4 4.4 These people are too calm.
于 2013-11-01T03:21:11.603 回答
2

看看这个问题。像这样的字符类[:space:]是在括号表达式中定义的,因此您需要将其括在一组括号中。尝试:

vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end"       "What do you think"        
# [3] "I am going crazy"          "These people are too calm"

这摆脱了领先的空间。要保留标点符号,请使用积极的后向断言perl = TRUE

vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end."       "What do you think?"        
# [3] "I am going crazy!"          "These people are too calm."
于 2013-11-01T03:46:39.980 回答
1

您可以将标点符号后面的空格替换为字符串,例如zzzzz,然后在该字符串上拆分。

x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think?   I am going crazy!    These people are too calm.")
strsplit(x, "zzzzz")

替换字符串中的where\1指的是模式的带括号的子表达式。

于 2013-11-01T03:59:22.683 回答
1

qdap 版本 1.1.0 开始sent_detect,您可以按如下方式使用该函数:

library(qdap)
sent_detect(x)

## [1] "The world is at end."       "What do you think?"        
## [3] "I am going crazy!"          "These people are too calm."
于 2014-02-26T23:18:14.870 回答