r - 使用 text2vec 循环标记化

Question

编辑以缩短并提供示例数据。

我有包含 8 个问题的文本数据，向多个参与者询问了两次。我想使用 text2vec 来比较他们在两个时间点对这些问题的回答的相似性（重复检测）。以下是我的初始数据的结构（在此示例中，只有 3 名参与者，4 个问题而不是 8 个问题，以及 2 个季度/时间段）。我想对每个参与者在第一季度与第二季度的反应进行相似性比较。我打算使用包 text2vec 的 psim 命令来执行此操作。

df<-read.table(text="ID,Quarter,Question,Answertext
               Joy,1,And another question,adsfjasljsdaf jkldfjkl
               Joy,2,And another question,dsadsj jlijsad jkldf 
               Paul,1,And another question,adsfj aslj sd afs dfj ksdf
               Paul,2,And another question,dsadsj jlijsad
               Greg,1,And another question,adsfjasljsdaf
               Greg,2,And another question, asddsf asdfasd sdfasfsdf
               Joy,1,this is the first question that was asked,this is joys answer to this question
               Joy,2,this is the first question that was asked,this is joys answer to this question
               Paul,1,this is the first question that was asked,this is Pauls answer to this question
               Paul,2,this is the first question that was asked,Pauls answer is different 
               Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
               Greg,2,this is the first question that was asked,this is Gregs answer to this question
               Joy,1,This is the text of another question,more random text
               Joy,2,This is the text of another question, adkjjlj;ds sdafd
               Paul,1,This is the text of another question,more random text
               Paul,2,This is the text of another question, adkjjlj;ds sdafd
               Greg,1,This is the text of another question,more random text
               Greg,2,This is the text of another question,sdaf asdfasd asdff
               Joy,1,this was asked second.,some random text
               Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
               Paul,1,this was asked second.,some random text
               Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
               Greg,1,this was asked second.,some random text
               Greg,2,this was asked second.,ada dasdffasdf asdf  asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')

我做了更多的思考，我相信正确的方法是将数据框拆分为数据框列表，而不是单独的项目。

questlist<-split(df,f=df$Question)

然后编写一个函数来为每个问题创建词汇表。

library(text2vec)

vocabmkr<-function(x) { itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer() }

test<-lapply(questlist, vocabmkr)

但后来我认为我需要将原始数据框拆分为问题季度组合，并将另一个列表中的词汇应用到它上面，但不知道该怎么做。

最终，我想要一个相似度分数来告诉我参与者是否重复了他们在第一季度和第二季度的部分或全部回答。

编辑：这是我从上述数据框开始的单个问题的方法。

quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()

quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1

quest1q2<-filter(quest1,Quarter==2) 
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2

#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)

similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.

b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)

编辑：好的，我已经与 lapply 合作了一些。

df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question

vocabmkr<-function(x) {
  itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}

vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.

dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)

如何将词汇列表（由 4 个元素组成）应用于 dfqq 列表（由 8 个元素组成）？

score 1 · Accepted Answer

对不起，这听起来令人沮丧。如果您有更多事情要做并且确实想要一种更自动的方式来完成它，这里有一种可能对您有用的方法：

首先，将单个数据帧的示例代码转换为函数：

analyze_vocab <- function(df_) {
  quest1vocab =
    itoken(as.character(df_$Answertext), ids = df_$ID) %>%
    create_vocabulary() %>%
    prune_vocabulary(term_count_min = 1) %>%
    vocab_vectorizer()

  quarter1 = filter(df_, Quarter == 1)
  quarter1 = itoken(as.character(quarter1$Answertext), 
                    ids = quarter1$ID)

  quarter2 = filter(df_, Quarter == 2)
  quarter2 = itoken(as.character(quarter2$Answertext),
                    ids = quarter2$ID)

  q1mat = create_dtm(quarter1, quest1vocab)
  q2mat = create_dtm(quarter2, quest1vocab)

  similarity = psim2(q1mat, q2mat, method = "jaccard", norm = "none")

  b = data.frame(
    ID = names(similarity),
    Similarity = similarity)

  output <- full_join(b, df_)
  return(output)
}

现在，您可以split根据需要使用，然后lapply像这样使用：lapply(split(df, df$Question), analyze_vocab). 但是，您似乎已经对管道感到满意，因此您不妨采用这种方法：

similarity_df <- df %>% 
  group_by(Question) %>%
  do(analyze_vocab(.))

输出：

> head(similarity_df, 12)
# A tibble: 12 x 5
# Groups:   Question [2]
   ID    Similarity Quarter Question                                  Answertext                                           
   <fct>      <dbl>   <int> <fct>                                     <fct>                                                
 1 Joy        0           1 And another question                      adsfjasljsdaf jkldfjkl                               
 2 Joy        0           2 And another question                      "dsadsj jlijsad jkldf "                              
 3 Paul       0           1 And another question                      adsfj aslj sd afs dfj ksdf                           
 4 Paul       0           2 And another question                      dsadsj jlijsad                                       
 5 Greg       0           1 And another question                      adsfjasljsdaf                                        
 6 Greg       0           2 And another question                      " asddsf asdfasd sdfasfsdf"                          
 7 Joy        1           1 this is the first question that was asked this is joys answer to this question                 
 8 Joy        1           2 this is the first question that was asked this is joys answer to this question                 
 9 Paul       0.429       1 this is the first question that was asked this is Pauls answer to this question                
10 Paul       0.429       2 this is the first question that was asked "Pauls answer is different "                         
11 Greg       0.667       1 this is the first question that was asked this is Gregs answer to this question nearly the same
12 Greg       0.667       2 this is the first question that was asked this is Gregs answer to this question

相似性的值与您的示例中显示的值相匹配endproduct（请注意，显示的值是四舍五入的 tibble 显示），因此它似乎按预期工作。

score 0 · Accepted Answer

我放弃并一次手动完成一个数据帧。我确信有一种简单的方法可以将其作为列表来执行，但我一生都无法弄清楚如何将函数列表（词汇矢量化器）应用于数据框列表中的“Answertext”列。

与 R 一样强大，一个允许文本交换到命令中的简单 for 循环（la Stata 的“foreach”）严重缺乏。我知道有一个不同的工作流程，涉及将数据框分解为列表并对其进行迭代，但对于某些活动，这会使事情变得非常复杂，需要复杂的索引不仅要引用列表，还要引用列表中包含的特定向量。我也认识到可以使用 assign 和 paste0 来实现类似 Stata 的行为，但这与 R 中的大多数代码一样，非常笨拙和迟钝。叹。

r - 使用 text2vec 循环标记化

2 回答 2

Related

Reference