r - 使用 cSplit 将字符串按大写字母拆分为多行

Question

我有调查数据。有些问题允许多个答案。在我的数据中，不同的答案用逗号分隔。我想在数据框中为每个选择添加一个新行。所以我有这样的事情：

survey$q1 <- c("I like this", "I like that", "I like this, but not much",
 "I like that, but not much", "I like this,I like that", 
"I like this, but not much,I like that")

如果逗号只是用来划分我会使用的多个选择：

survey <- cSplit(survey, "q1", ",", direction = "long")

并得到想要的结果。鉴于一些逗号是答案的一部分，我尝试使用逗号后跟大写字母作为分隔符：

survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")

但由于某种原因，它不起作用。它不会给出任何错误，但不会拆分字符串，还会从数据框中删除一些行。然后我尝试使用strsplit：

strsplit(survey$1, ",(?=[A-Z])", perl=T)

它可以正确拆分它，但我无法实现它，以便每个句子成为同一列的不同行，就像 cSplit 一样。所需的输出是：

survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"

有没有办法使用这两种方法之一来获得它？谢谢

score 2 · Accepted Answer

一个选项separate_rows

library(dplyr)
library(tidyr)
survey %>% 
   separate_rows(q1, sep=",(?=[A-Z])")
#                       q1
#1               I like this
#2               I like that
#3 I like this, but not much
#4 I like that, but not much
#5               I like this
#6               I like that
#7 I like this, but not much
#8               I like that

使用cSplit，默认情况下有一个参数fixed，TRUE但如果我们使用fixed = FALSE，它可能会失败。可能是因为它没有针对 PCRE 正则表达式进行优化

library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)

strsplit（indt[[splitCols[x]]]，split = sep[x]，fixed = fixed）中的错误：无效的正则表达式'，（？= [AZ]）'，原因'无效的正则表达式'

绕过它的一种选择是使用函数 ( sub/gsub) 修改列，该函数可以采用 PCRE 正则表达式来更改sep然后cSplit在其上使用sep

cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
         "q1", sep=":", direction = "long")
#                        q1
#1:               I like this
#2:               I like that
#3: I like this, but not much
#4: I like that, but not much
#5:               I like this
#6:               I like that
#7: I like this, but not much
#8:               I like that

数据

survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much", 
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))

score 1 · Accepted Answer

@akrun 的答案是正确的。我只是想补充一点，如果您需要将一些字符串拆分为 2 个以上的部分，那么他的代码的工作方式就是多次运行同一行。我不完全确定为什么会这样，但它有效

r - 使用 cSplit 将字符串按大写字母拆分为多行

2 回答 2

数据

Related

Reference