r - 将文档术语矩阵转换为包含大量数据的矩阵会导致溢出

Question

让我们做一些文本挖掘

在这里，我支持一个文档术语矩阵（来自tm包）

dtm <- TermDocumentMatrix(
     myCorpus,
     control = list(
         weight = weightTfIdf,
         tolower=TRUE,
         removeNumbers = TRUE,
         minWordLength = 2,
         removePunctuation = TRUE,
         stopwords=stopwords("german")
      ))

当我做一个

typeof(dtm)

我看到它是一个“列表”，结构看起来像

Docs
Terms        1 2 ...
  lorem      0 0 ...
  ipsum      0 0 ...
  ...        .......

所以我尝试一个

wordMatrix = as.data.frame( t(as.matrix(  dtm )) )

这适用于 1000 个文档。

但是当我尝试使用 40000 时，它不再适用了。

我收到此错误：

Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt

向量中的错误...：向量不能是 NA 附加：在 nr * nc NAs 中由整数溢出创建

所以我查看了 as.matrix ，结果发现该函数以某种方式将其转换为带有 as.vector 的向量而不是矩阵。到向量的转换有效，但从向量到矩阵的转换无效。

你有什么建议可能是什么问题？

谢谢，船长

score 17 · Accepted Answer

整数溢出确切地告诉你问题是什么：有 40000 个文档，你有太多的数据。顺便说一句，问题是在转换为矩阵时开始的，如果您查看底层函数的代码，可以看出这一点：

class(dtm)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

getAnywhere(as.matrix.simple_triplet_matrix)

A single object matching ‘as.matrix.simple_triplet_matrix’ was found
...
function (x, ...) 
{
    nr <- x$nrow
    nc <- x$ncol
    y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
   ...
}

这是错误消息引用的行。发生了什么，可以通过以下方式轻松模拟：

as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
[1] NA
Warning message:
NAs introduced by coercion

The function vector() takes an argument with the length, in this case nr*nc If this is larger than appx. 2e9 ( .Machine$integer.max ), it will be replaced by NA. This NA is not valid as an argument for vector().

Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.

PS : I made a dtm object by

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))

score 4 · Accepted Answer

Here is a very very simple solution I discovered recently

DTM=t(TDM)#taking the transpose of Term-Document Matrix though not necessary but I prefer DTM over TDM
M=as.big.matrix(x=as.matrix(DTM))#convert the DTM into a bigmemory object using the bigmemory package 
M=as.matrix(M)#convert the bigmemory object again to a regular matrix
M=t(M)#take the transpose again to get TDM

Please note that taking transpose of TDM to get DTM is absolutely optional, it's my personal preference to play with matrices this way

P.S.Could not answer the question 4 years back as I was just a fresh entry in my college

score 0 · Accepted Answer

根据 Joris Meys 的回答，我找到了解决方案。关于“length”参数的“vector()”文档

... 对于长向量，即长度 > .Machine$integer.max，它必须是“double”类型...

所以我们可以对 as.matrix() 做一个微小的修复：

as.big.matrix <- function(x) {
  nr <- x$nrow
  nc <- x$ncol
  # nr and nc are integers. 1 is double. Double * integer -> double
  y <- matrix(vector(typeof(x$v), 1 * nr * nc), nr, nc)
  y[cbind(x$i, x$j)] <- x$v
  dimnames(y) <- x$dimnames
  y
}

r - 将文档术语矩阵转换为包含大量数据的矩阵会导致溢出

3 回答 3

Related

Reference