我在列中有一个带有语音数据的数据框speech。
df <- data.frame(
id = c("A", "A", "B", NA, "A", "B", "B", "B", "B"),
speech = c("hi", "how are you [Larry]?", "[uh]", "(0.123)",
"I'm fine [you 'n Mary] how's it [goin]?", "[erm]", "(0.4)", "well", "y'know what it's like")
)
我需要过滤掉 (i) 由相同的idAND (ii) 其第一个speech值是方括号中的内容的行[...]。但是,过滤器不应该消耗所有相同id的语音speech:它应该在匹配圆括号中的(...)内容或简单文本时立即停止。
我可以过滤df开条件(i)和(ii):
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(id)) %>%
filter(grepl("^\\[.*?\\]$", first(speech)))
这使:
# A tibble: 5 x 3
# Groups: grp [2]
id speech grp
<chr> <chr> <int>
1 B [uh] 2
2 B [erm] 5
3 B (0.4) 5
4 B well 5
5 B y'know what it's like 5
但我不知道如何在speechwith(...)或 with 文本处停止过滤器。预期的结果是这样的:
# A tibble: 5 x 3
# Groups: grp [2]
id speech grp
<chr> <chr> <int>
1 B [uh] 2
2 B [erm] 5
非常感谢您的帮助!
编辑:
好像我自己找到了解决方案:
df %>%
group_by(grp = rleid(id)) %>%
filter(grepl("^\\[.*?\\]$", first(speech)) & !grepl("^\\(\\d\\.\\d{1}\\)$|^\\w", speech))
但无论如何,感谢所有考虑过的人!