给定以下示例数据框:
Question <- c("Q1", "Q1", "Q1","Q1","Q2", "Q2", "Q2","Q2")
Answer <- c("I like to be creative when I cook with crock pots.","I like to be creative when I cook with crock pots.",
"I like to be creative when I cook with crock pots.","I like to be unique when I cook with a skillet.",
"I like to be creative when I cook with crock pots.","I like to be unique when I cook with a skillet.",
"I like to be unique when I cook with a skillet.","I like to be unique when I cook with a skillet.")
QAID <- c("Q11", "Q12", "Q13","Q14","Q21", "Q22", "Q23","Q24")
v <- data.frame(Question, Answer, QAID)
给定以下代码:
library(dplyr)
library(udpipe)
#Download your own instance of the english model to call here
udmodel_english <- udpipe_load_model(file = "english-ewt-ud-2.4-190531.udpipe")
t <- udpipe_annotate(udmodel_english, v$Answer, doc_id = paste0(v$QAID,'~',v$Question))
x <- data.frame(t)
x <- x %>%
mutate(Question = sub(".*~", "", doc_id),
ID = sub("~.*", "", doc_id))
stats <- keywords_rake(x = x, term = "lemma", group = "Question",
relevant = x$upos %in% c("NOUN", "ADJ"))
x$term <- txt_recode_ngram(x$lemma, compound = stats$keyword, ngram = stats$ngram)
x$term <- ifelse(!x$term %in% stats$keyword, NA, x$term)
x <- x %>%
left_join(stats, by = c("term" = "keyword")) %>%
filter(!is.na(term))
我希望得到以下输出:
我希望这个输出,因为我试图按问题对 RAKE 输出进行分组,而不是跨越两个问题:
keywords_rake(x = x, term = "lemma", group = "Question",
relevant = x$upos %in% c("NOUN", "ADJ"))
但是,我的输出如下所示:
尽管关键字 Crock Pot 在 Q2 组中只使用了一次,在 Q1 组中使用了 3 次,但我得到了相同的 rake 分数,频率为 4。
检查函数中group
参数的注释keywords_rake
会发现以下内容:
具有来自 x 的 1 列或多列的字符向量,表示例如文档 id 或句子 id。将在该组中计算关键字,以便不跨句子或文档查找关键字。
我的问题:
我是否group
错误地使用了参数?我应该如何使用 RAKE 算法在单个问题中获得关键字的 rake 分数,而不是在所有问题中?我知道我可以遍历问题,但是在添加开销之前,我想检查是否有内置的方法来处理这个问题。我是否错误地考虑了这个功能?