我正在尝试获取其中一个关键字的推文,比如说“zomato”,并尝试对获取的推文进行主题建模。以下是获取推文的搜索功能。
search <- function(searchterm)
{
#access tweets and create cumulative file
list <- searchTwitter(searchterm, n=25000)
df <- twListToDF(list)
df <- df[, order(names(df))]
df$created <- strftime(df$created, '%Y-%m-%d')
if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge last access with cumulative file and remove duplicates
stack <- read.csv(file=paste(searchterm, '_stack.csv'))
stack <- rbind(stack, df)
stack <- subset(stack, !duplicated(stack$text))
return(stack)
}
ZomatoResults<- search('Zomato')
发布这个我会清理推文,这通常会完成并存储在变量“ZomatoCleaned”中。我还没有添加那段代码。然后我形成语料库做主题建模如下图
options(mc.cores = 1) # or whatever
tm_parLapply_engine(parallel::mclapply)
corpus <- Corpus(VectorSource(ZomatoCleaned)) # Create corpus object
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
dev.new(width = 1000, height = 1000, unit = "px")
wordcloud(corpus, min.freq=2, max.words = 100, random.order = TRUE, col = pal)
dat <- DocumentTermMatrix(corpus)
dput(head(dat))
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])
# model <- LDA(dtm, 10) # Go ahead and test a simple model if you want
SEED = sample(1:1000000, 1) # Pick a random seed for replication
k = 10 # Let's start with 10 topics
models <- list(
CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))),
VEM = LDA(dtm, k = k, control = list(seed = SEED)),
VEM_Fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000))
)
lapply(models, terms, 10)
assignments <- sapply(models, topics)
head(assignments, n=10)
不幸的是
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
我收到错误“R 中指定的向量大小太大”或“无法分配大小为 36.6Gb 的向量”。我正在使用 8Gb Ram 系统和 Rstudio 3.5.2 我已经运行 gc() 命令并尝试设置 memory.limit() 但没有帮助。是否有一些解决方法来处理这个数据集?我知道这是内存问题,但请就如何解决这种情况提供帮助
dat 的 O/P:structure(c(0, 1, 0, 0, 0, 0), weighting = c("term frequency", "tf"), class = c("DocumentTermMatrix", "simple_triplet_matrix"))