r - 从 R 主题模型中的 DocumentTermMatrix 中删除空文档？

Question

我正在使用 R 中的 topicmodels 包进行主题建模。我正在创建一个语料库对象，进行一些基本的预处理，然后创建一个 DocumentTermMatrix：

corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) 
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))

然后执行 LDA：

LDA(dtm, 30)

对 LDA() 的最终调用返回错误

  "Each row of the input matrix needs to contain at least one non-zero entry".

我假设这意味着至少有一个文档在预处理后没有术语。有没有一种简单的方法可以从 DocumentTermMatrix 中删除不包含任何术语的文档？

我查看了 topicmodels 包的文档，发现了 removeSparseTerms 函数，该函数删除了未出现在任何文档中的术语，但没有类似的删除文档。

score 62 · Accepted Answer

"Each row of the input matrix needs to contain at least one non-zero entry"

该错误意味着稀疏矩阵包含一行没有条目（单词）。一个想法是逐行计算单词的总和

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]           #remove all docs without words

score 28 · Accepted Answer

agstudy 的答案效果很好，但在速度较慢的计算机上使用它被证明存在轻微问题。

tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed

（这是用 4000x15000 dtm 完成的）

瓶颈似乎适用sum()于稀疏矩阵。

由tm包创建的文档术语矩阵包含名称 i 和 j ，它们是稀疏矩阵中条目所在位置的索引。如果dtm$i不包含特定的行索引p，则行为p空。

tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed

ui包含所有非零索引，并且由于dtm$i已经排序，dtm.new因此将与dtm. 性能增益对于较小的文档术语矩阵可能无关紧要，但对于较大的矩阵可能会变得很重要。

score 12 · Accepted Answer

这只是为了详细说明agstudy给出的答案。

代替从 dtm 矩阵中删除空行，我们可以识别我们的语料库中长度为零的文档并直接从语料库中删除文档，然后再执行仅包含非空文档的第二个 dtm。

这对于保持 dtm 和语料库之间的 1:1 对应关系很有用。

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]] corpus <- corpus[-as.numeric(empty.rows)]

score 2 · Accepted Answer

只需从 DTM 中删除稀疏术语，一切都会正常工作。

dtm <- DocumentTermMatrix(crude, sparse=TRUE)

score 0 · Accepted Answer

只是对达里奥·拉康的回答的一个小补充：

empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]

将收集记录id，而不是订单号。试试这个：

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
dtm[1, ]$dimnames[1][[1]] # return "127", not "1"

如果用连续编号构建自己的语料库，数据清洗后，一些文档会被删除，编号也会被破坏。所以，最好id直接使用：

corpus <- tm_filter(
  corpus,
  FUN = function(doc) !is.element(meta(doc)$id, empty.rows))
  # !( meta(doc)$id %in% emptyRows )
)

score 0 · Accepted Answer

lt$title我在包含字符串的数据框中有一列。我在此列中没有“空”行，但仍然出现错误：

LDA(dtm, k = 20, control = list(seed = 813)) 中的错误：输入矩阵的每一行都需要包含至少一个非零条目

上面的一些解决方案对我不起作用，因为我需要将预测主题的向量加入到我的原始数据框中。因此，从文档术语矩阵中删除非零条目是没有选择的。

问题是，其中的一些（非常短的）字符串lt$title包含无法由Corpus()和/或处理的特殊字符DocumentTermMatrix()。

我的解决方案是删除无论如何都不包含太多信息的“短”字符串（最多一个或两个单词）。

# Clean up text data
lt$test=nchar(lt$title)
lt = lt[!lt$test<10,]
lt$test<-NULL

# Topic modeling
corpus <- Corpus(VectorSource(lt$title))
dtm = DocumentTermMatrix(corpus)
tm = LDA(dtm, k = 20, control = list(seed = 813))

# Add "topics" to original DF
lt$topic = topics(tm)

r - 从 R 主题模型中的 DocumentTermMatrix 中删除空文档？

6 回答 6

Related

Reference