0

我正在尝试获取其中一个关键字的推文,比如说“zomato”,并尝试对获取的推文进行主题建模。以下是获取推文的搜索功能。

 search <- function(searchterm)
 {
 #access tweets and create cumulative file

 list <- searchTwitter(searchterm, n=25000)
 df <- twListToDF(list)
 df <- df[, order(names(df))]
 df$created <- strftime(df$created, '%Y-%m-%d')
 if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge last access with cumulative file and remove duplicates
 stack <- read.csv(file=paste(searchterm, '_stack.csv'))
 stack <- rbind(stack, df)
 stack <- subset(stack, !duplicated(stack$text))

return(stack)

}
ZomatoResults<- search('Zomato') 

发布这个我会清理推文,这通常会完成并存储在变量“ZomatoCleaned”中。我还没有添加那段代码。然后我形成语料库做主题建模如下图

options(mc.cores = 1)  # or whatever
tm_parLapply_engine(parallel::mclapply) 

corpus <- Corpus(VectorSource(ZomatoCleaned))  # Create corpus object
corpus <- tm_map(corpus, removeWords, stopwords("en"))  
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
dev.new(width = 1000, height = 1000, unit = "px")
wordcloud(corpus, min.freq=2, max.words = 100, random.order = TRUE, col = pal)
dat <- DocumentTermMatrix(corpus)
dput(head(dat))
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])
# model <- LDA(dtm, 10)  # Go ahead and test a simple model if you want

SEED = sample(1:1000000, 1)  # Pick a random seed for replication
k = 10  # Let's start with 10 topics

models <- list(
  CTM       = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))),
  VEM       = LDA(dtm, k = k, control = list(seed = SEED)),
  VEM_Fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
  Gibbs     = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,
                                                               thin = 100,    iter = 1000))
)

lapply(models, terms, 10)
assignments <- sapply(models, topics) 
head(assignments, n=10)

不幸的是

doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))

我收到错误“R 中指定的向量大小太大”或“无法分配大小为 36.6Gb 的向量”。我正在使用 8Gb Ram 系统和 Rstudio 3.5.2 我已经运行 gc() 命令并尝试设置 memory.limit() 但没有帮助。是否有一些解决方法来处理这个数据集?我知道这是内存问题,但请就如何解决这种情况提供帮助

主题建模错误 Zomato

dat 的 O/P:structure(c(0, 1, 0, 0, 0, 0), weighting = c("term frequency", "tf"), class = c("DocumentTermMatrix", "simple_triplet_matrix"))

数据输出图像

4

0 回答 0