3

我有一个术语文档矩阵(16,977 个术语,29,414 个文档):

Non-/sparse entries: 355000/499006478
Sparsity           : 100%
Maximal term length: 7 
Weighting          : term frequency (tf)

为了进一步分析,我将术语数限制为 2,425。例如,如何通过包含频率超过 20 的术语来生成新的术语文档矩阵?

由于矩阵很大,as.matrix不能应用传统方法。

4

2 回答 2

5

像这样的东西可能会起作用...使用 slam 包中的函数将 DTM 索引为简单的三元组矩阵,这样您就不必将其转换为密集矩阵。

library(slam)
library(tm)
data(crude)
dtm1 <- DocumentTermMatrix(crude)


# Find the total occurances of each word in all docs
colTotals <-  col_sums(dtm1)

# keep only  words that occur >20 times in all docs
dtm2 <- dtm1[,which(colTotals > 20)]

> dtm1
A document-term matrix (20 documents, 1266 terms)

Non-/sparse entries: 2255/23065
Sparsity           : 91%
Maximal term length: 17 
Weighting          : term frequency (tf)

> dtm2
A document-term matrix (20 documents, 12 terms)

Non-/sparse entries: 174/66
Sparsity           : 28%
Maximal term length: 6 
Weighting          : term frequency (tf)

这是否适用于您的数据并回答您的问题?

于 2013-06-02T09:28:46.097 回答
0

我认为这可以通过控制列表实现

 library(tm)
 dtm <- DocumentTermMatrix(your.corpus, control = list(
          bounds=list(global=c(20,Inf))
 ))
 inspect(dtm)
于 2020-09-24T21:21:15.760 回答