我找到了一种在术语文档矩阵中使用二元组而不是单个标记的方法。该解决方案已在此处的 stackoverflow 上提出: findAssocs for multiple terms in R
这个想法是这样的:
library(tm)
library(RWeka)
data(crude)
#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
但是最后一行给了我错误:
Error in rep(seq_along(x), sapply(tflist, length)) :
invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
如果我从最后一行删除标记器,它会创建一个常规 tdm,所以我猜问题出在 BigramTokenizer 函数中,尽管这与 Weka 网站在此处给出的示例相同:http://tm.r-forge.r -project.org/faq.html#Bigrams。