1

我有一个数据框,其中包含整个句子的一部分,在某些情况下,它包含多行数据框。

例如,head(mydataframe)返回

#  1 Do you have any idea what
#  2  they were arguing about?
#  3          Do--Do you speak
#  4                  English?
#  5                     yeah.
#  6            No, I'm sorry.

假设一个句子可以通过以下任一方式终止

“。” 或者 ”?” 或者 ”!” 或者 ”...”

是否有任何 R 库函数能够输出以下内容:

#  1 Do you have any idea what they were arguing about?
#  2          Do--Do you speak English?
#  3                     yeah.
#  4            No, I'm sorry.
4

2 回答 2

4

. ... ?这应该适用于所有以:或结尾的句子!

x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))

感谢@AvinashRaj 提供的后视指针

这使:

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah..."                                           
#[4] "No, I'm sorry." 

数据

我修改了玩具数据集以包含字符串以结尾的情况...(根据 OP 的要求)

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah...", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)
于 2015-11-15T12:34:11.837 回答
3

这是我得到的。我相信有更好的方法来做到这一点。这里我使用了基本函数。我创建了一个名为foo. 首先,我创建了一个包含所有文本的字符串txttoString()补充说,,所以我在第一次删除了它们gsub()。然后,我在第二个中处理了空白(超过 2 个空格)gsub()。然后,我按您指定的分隔符拆分字符串。将这篇文章归功于 Tyler Rinker ,我设法将分隔符留在了strsplit(). 最后的工作是去除句子初始位置的空格。然后,取消列出该列表。

编辑 Steven Beaupré 修改了我的代码。这就是要走的路!

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah.", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)

library(magrittr)

toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x) 
            {gsub(pattern = "^ ", replacement = "", x = x)
      }) %>%
unlist

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah."                                             
#[4] "No I'm sorry." 
于 2015-11-15T12:16:53.307 回答