假设我有一个字符串,例如以下。
x <- 'The world is at end. What do you think? I am going crazy! These people are too calm.'
我只需要在标点符号!?.
和后面的空格上拆分并保留标点符号。
这会删除标点符号并在拆分部分中留下前导空格
vec <- strsplit(x, '[!?.][:space:]*')
如何拆分离开标点符号的句子?
qdap 包中的sentSplit
函数是专门为此任务创建的:
library(qdap)
sentSplit(data.frame(text = x), "text")
## tot text
## 1 1.1 The world is at end.
## 2 2.2 What do you think?
## 3 3.3 I am going crazy!
## 4 4.4 These people are too calm.
看看这个问题。像这样的字符类[:space:]
是在括号表达式中定义的,因此您需要将其括在一组括号中。尝试:
vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end" "What do you think"
# [3] "I am going crazy" "These people are too calm"
这摆脱了领先的空间。要保留标点符号,请使用积极的后向断言perl = TRUE
:
vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end." "What do you think?"
# [3] "I am going crazy!" "These people are too calm."
您可以将标点符号后面的空格替换为字符串,例如zzzzz
,然后在该字符串上拆分。
x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think? I am going crazy! These people are too calm.")
strsplit(x, "zzzzz")
替换字符串中的where\1
指的是模式的带括号的子表达式。
从qdap 版本 1.1.0 开始sent_detect
,您可以按如下方式使用该函数:
library(qdap)
sent_detect(x)
## [1] "The world is at end." "What do you think?"
## [3] "I am going crazy!" "These people are too calm."