给定三个 TermDocumentMatrix、text1、text2 和 text3,我想将它们中的每一个的词频计算到一个数据框中并 rbind 所有数据框。三个是样本——我实际上有数百个,所以我需要对其进行功能化。
计算一个 TDM 的词频很容易:
apply(x, 1, sum)
或者
rowSums(as.matrix(x))
我想列出 TDM:
tdm_list <- Filter(function(x) is(x, "TermDocumentMatrix"), mget(ls()))
并计算每个单词的频率并将其放入数据框中:
data.frame(lapply(tdm_list, sum)) # this is wrong. it simply sums frequency of all words instead of frequency by each word.
然后 rbind 全部:
do.call(rbind, df_list)
我不知道如何在 TDM 上使用 lapply 来计算词频。
添加示例数据以玩弄:
require(tm)
text1 <- c("apple" , "love", "crazy", "peaches", "cool", "coke", "batman", "joker")
text2 <- c("omg", "#rstats" , "crazy", "cool", "bananas", "functions", "apple")
text3 <- c("Playing", "rstats", "football", "data", "coke", "caffeine", "peaches", "cool")
tdm1 <- TermDocumentMatrix(Corpus(VectorSource(text1)))
tdm2 <- TermDocumentMatrix(Corpus(VectorSource(text2)))
tdm3 <- TermDocumentMatrix(Corpus(VectorSource(text3)))