我正在研究会话口语轮流中的语音,并希望提取轮流重复的单词。我正在努力解决的任务是提取不准确重复的单词。
数据:
X <- data.frame(
speaker = c("A","B","A","B"),
speech = c("i'm gonna take a look you okay with that",
"sure looks good we can take a look you go first",
"okay last time I looked was different i think that is it yeah",
"yes you're right i think that's it"), stringsAsFactors = F
)
我有一个成功提取精确重复的for
循环:
# initialize vectors:
pattern1 <- c()
extracted1 <- c()
# run `for` loop:
library(stringr)
for(i in 2:nrow(X)){
# define each 'speech` element as a pattern for the next `speech` element:
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(X$speech[i-1], " ")), collapse = "|"), ")\\b")
# extract all matched words:
extracted1[i] <- str_extract_all(X$speech[i], pattern1[i-1])
}
# result:
extracted1
[[1]]
NULL
[[2]]
[1] "take" "a" "look" "you"
[[3]]
character(0)
[[4]]
[1] "i" "think" "that" "it"
但是,我也想提取不精确的重复。例如,第 2行是第 1 行looks
的不精确重复,第 3行是第 2 行的模糊重复,第 4 行是第 3 行的近似匹配。我最近遇到,它用于近似匹配,但我不知道如何在这里使用它或者它是否是正确的方法。任何帮助是极大的赞赏。look
looked
looks
yes
yeah
agrep
请注意,实际数据包含数千个具有高度不可预测内容的说话轮次,因此无法事先定义所有可能变体的列表。