1

给定以下示例数据框:

Question <- c("Q1", "Q1", "Q1","Q1","Q2", "Q2", "Q2","Q2")
Answer <- c("I like to be creative when I cook with crock pots.","I like to be creative when I cook with crock pots.",
            "I like to be creative when I cook with crock pots.","I like to be unique when I cook with a skillet.",
            "I like to be creative when I cook with crock pots.","I like to be unique when I cook with a skillet.",
            "I like to be unique when I cook with a skillet.","I like to be unique when I cook with a skillet.")
QAID <- c("Q11", "Q12", "Q13","Q14","Q21", "Q22", "Q23","Q24")

v <- data.frame(Question, Answer, QAID)

给定以下代码:

library(dplyr)
library(udpipe)

#Download your own instance of the english model to call here
udmodel_english <- udpipe_load_model(file = "english-ewt-ud-2.4-190531.udpipe")

t <- udpipe_annotate(udmodel_english, v$Answer, doc_id = paste0(v$QAID,'~',v$Question))
x <- data.frame(t)

x <- x %>%
  mutate(Question = sub(".*~", "", doc_id),
         ID = sub("~.*", "", doc_id))

stats <- keywords_rake(x = x, term = "lemma", group = "Question", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))

x$term <- txt_recode_ngram(x$lemma, compound = stats$keyword, ngram = stats$ngram)
x$term <- ifelse(!x$term %in% stats$keyword, NA, x$term)

x <- x %>%
  left_join(stats, by = c("term" = "keyword")) %>%
  filter(!is.na(term))

我希望得到以下输出:

在此处输入图像描述

我希望这个输出,因为我试图按问题对 RAKE 输出进行分组,而不是跨越两个问题:

keywords_rake(x = x, term = "lemma", group = "Question", 
                       relevant = x$upos %in% c("NOUN", "ADJ"))

但是,我的输出如下所示:

在此处输入图像描述

尽管关键字 Crock Pot 在 Q2 组中只使用了一次,在 Q1 组中使用了 3 次,但我得到了相同的 rake 分数,频率为 4。

检查函数中group参数的注释keywords_rake会发现以下内容:

具有来自 x 的 1 列或多列的字符向量,表示例如文档 id 或句子 id。将在该组中计算关键字,以便不跨句子或文档查找关键字。

我的问题:

我是否group错误地使用了参数?我应该如何使用 RAKE 算法在单个问题中获得关键字的 rake 分数,而不是在所有问题中?我知道我可以遍历问题,但是在添加开销之前,我想检查是否有内置的方法来处理这个问题。我是否错误地考虑了这个功能?

4

0 回答 0