我正在尝试使用出色的 Quanteda 包从大型文本语料库(R 中的对象大小约为 1Gb)构建 n-gram。我没有可用的云资源,所以我使用自己的笔记本电脑(Windows 和/或 Mac,12Gb RAM)进行计算。
如果我将数据采样成碎片,代码就可以工作,我得到一个(部分)不同大小的 n-gram dfm,但是当我尝试在整个语料库上运行代码时,不幸的是,我用这个语料库大小达到了内存限制,并得到以下错误(unigrams,单个单词的示例代码):
> dfm(corpus, verbose = TRUE, stem = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features:
Error: cannot allocate vector of size 1024.0 Mb
In addition: Warning messages:
1: In unique.default(allFeatures) :
Reached total allocation of 11984Mb: see help(memory.size)
如果我尝试构建 n > 1 的 n-gram,那就更糟了:
> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage 19925140 is too close to the limit
我找到了这个相关的帖子,但它看起来是密集矩阵强制的问题,后来解决了,它对我的情况没有帮助。
有没有更好的方法可以在内存有限的情况下处理这个问题,而不必将语料库数据分解成碎片?
[编辑] 根据要求, sessionInfo() 数据:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3 quanteda_0.9.4
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.2 assertthat_0.1 Matrix_1.2-3 rsconnect_0.4.2 DBI_0.3.1
[7] parallel_3.2.3 tools_3.2.3 Rcpp_0.12.3 stringi_1.0-1 grid_3.2.3 chron_2.3-47
[13] lattice_0.20-33 ca_0.64