r - 如何在 r 中读写 TermDocumentMatrix？

Question

我使用 R 中的 csv 文件制作了 wordcloud。我使用TermDocumentMatrix了tm包中的方法。这是我的代码：

csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)

Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))

myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))

m <- as.matrix(TDM)

这个过程似乎花费了太多时间。我认为extractNoun这是花费太多时间的原因。为了使代码更省时，我想将生成的 TDM 保存为文件。当我阅读这个保存的文件时，我可以m <- as.matrix(saved TDM file)完全使用吗？或者，有没有更好的选择？

score 1 · Accepted Answer

我不是专家，但我有时会使用 NLP。

我确实parSapply从parallel包装中使用。这是文档http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

parallel带有 R 基础，这是一个愚蠢的使用示例：

library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")

base <- 2
parSapply(cl, as.character(2:4), 
          function(exponent){
            x <- as.numeric(exponent)
            c(base = base^x, self = x^x)
          })

所以，并行化nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)，它会更快:)

score 0 · Accepted Answer

我注意到您调用了几个库（tm）命令，这些命令也可以很容易地并行化。对于library tm，此功能在您提出问题一个月后于 2017 年 3 月更新。

在library tm版本 0.7 (2017-03-02)的发行说明的新功能部分中指出：

tm_parLapply() 现在在内部用于并行化转换、过滤器和术语文档矩阵构造。可以通过 tm_parLapply_engine() 注册首选并行化引擎。默认是不使用并行化（而不是以前版本中的 mclapply（并行包））。

要为 tm 命令设置并行化，以下对我有用：

library(parallel)
cores <- detectCores()
cl <- makeCluster(cores)   # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus, 
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)

如果您有通过 tm_map 内容转换器应用的函数，则需要在 tm_map(MyCorpus, content_transformer(clean)) 命令之前使用 clusterExport 将该函数传递给并行化环境。例如。将我的清洁功能传递给环境。

clusterExport(cl, "clean")

最后一条评论，请注意您的内存使用情况。如果您的计算机开始将内存分页到磁盘，则 CPU 不再是关键路径，所有并行化都不会产生影响。

r - 如何在 r 中读写 TermDocumentMatrix？

2 回答 2

Related

Reference