1

我有一个大型数据集(460 Mb),其中有一列 - 包含 386551 行的日志。我希望使用聚类和 N-Gram 方法来形成词云。我的代码如下:

library(readr)
AMC <- read_csv("All Tickets.csv")
Desc <- AMC[,4]

#Very large data hence breaking it down for creating corpus
#DataframeSource has been used insted of VectorSource is to be able to       handle the data

library(tm)
docs_new <- data.frame(Desc)

test1 <- docs_new[1:100000,]
test2 <- docs_new[100001:200000,]
test3 <- docs_new[200001:300000,]
test4 <- docs_new[300001:386551,]
test1 <- data.frame(test1)
test1 <- Corpus(DataframeSource(test1))
test2 <- data.frame(test2)
test2 <- Corpus(DataframeSource(test2))
test3 <- data.frame(test3)
test3 <- Corpus(DataframeSource(test3))
test4 <- data.frame(test4)
test4 <- Corpus(DataframeSource(test4))

# attach all the corpus
docs_new <- c(test1,test2,test3,test4)

docs_new <- tm_map(docs_new, tolower)
docs_new <- tm_map(docs_new, removePunctuation)
docs_new <- tm_map(docs_new, removeNumbers)
docs_new <- tm_map(docs_new, removeWords, stopwords(kind = "en"))
docs_new <- tm_map(docs_new, stripWhitespace)
docs_new <- tm_map(docs_new, stemDocument)
docs_new <- tm_map(docs_new, PlainTextDocument)

#tokenizer for tdm with ngrams
library(RWeka)
options(mc.cores=1) 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max =2))
tdm <- TermDocumentMatrix(docs_new, control = list(tokenize = BigramTokenizer))

这给了我如下结果:

TermDocumentMatrix (terms: 1874071, documents: 386551)>>
Non-/sparse entries: 17313767/724406705354
Sparsity           : 100%
Maximal term length: 733
Weighting          : term frequency (tf)

然后我使用以下方法将其转换为 dgMatrix:

library("Matrix")
mat <- sparseMatrix(i=tdm$i, j=tdm$j, x=tdm$v, dims=c(tdm$nrow, tdm$ncol))

在尝试使用以下内容时,我收到内存大小错误:

removeSparseTerms(tdm, 0.2)

请进一步提出建议,因为我是文本分析的新手。

4

0 回答 0