r - 尝试从 DocumentTermMatrix 中删除单词以使用主题模型

Question

因此，我正在尝试将topicmodels包用于R（约 6400 个文档的语料库中的 100 个主题，每个约 1000 个单词）。该进程运行然后死亡，我认为是因为它的内存不足。

所以我尝试缩小lda()函数作为输入的文档术语矩阵的大小；minDocFreq我想我可以在生成文档术语矩阵时使用该函数来做到这一点。但是当我使用它时，它似乎没有任何区别。这是一些代码：

这是相关的代码：

> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1]  6423 41613

相同的维度和相同的列数（即相同数量的术语）。

任何感觉我做错了什么？谢谢。

score 15 · Accepted Answer

您的问题的答案在这里：https : //stackoverflow.com/a/13370840/1036500（给它一个赞成票！）

简而言之，该tm软件包的更新版本不包括minDocFreq，而是使用bounds，例如，您的

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

现在应该是

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17

r - 尝试从 DocumentTermMatrix 中删除单词以使用主题模型

1 回答 1

Related

Reference