r - 在 R text2vec 包中 - LDA 模型可以显示文档中每个标记的主题分布？

Question

library (text2vec)
library (parallel)
library (doParallel)

N <- parallel::detectCores()
cl <- makeCluster (N)
registerDoParallel (cl)
Ky_young <- read.csv("./Ky_young.csv")

IT <- itoken_parallel (Ky_young$TEXTInfo,
                       ids          = Ky_young$ID,
                       tokenizer    = word_tokenizer,
                       progressbar  = F)

##stopword
stop_words = readLines("./stopwrd1.txt", encoding="UTF-8")

VOCAB <- create_vocabulary (
        IT, stopwords = stop_words
        ngram = c(1, 1)) %>%
        prune_vocabulary (term_count_min = 5)


VoCAB.order <- VOCAB[order((VOCAB$term_count), decreasing = T),]

VECTORIZER <- vocab_vectorizer (VOCAB)

DTM <- create_dtm (IT, VECTORIZER, distributed = F)


LDA_MODEL <- 
      LatentDirichletAllocation$new (n_topics         = 200,
                                     #vocabulary       = VOCAB, <= ERROR
                                     doc_topic_prior  = 0.1,  
                                     topic_word_prior = 0.01) 


##topic-document distribution
LDA_FIT <- LDA_MODEL$fit_transform (
        x = DTM, 
        n_iter = 50, 
        convergence_tol = -1, 
        n_check_convergence = 10)

#topic-word distribution
topic_word_prior = LDA_MODEL$topic_word_distribution

我在 text2vec 中创建了测试 LDA 代码，我可以得到 word-topic 分布和 document-topic 分布。（而且它快疯了）

顺便说一句，我想知道是否可以从 text2vec 的 LDA 模型中获取文档中每个标记的主题分布？

我理解LDA分析过程的结果是文档中的每个token都属于特定的主题，因此每个文档都有主题分布。

如果我能得到每个令牌的主题分布，我喜欢通过分类文档（如句号）检查每个主题的热门词变化。是否可以？

如果有其他方法，我会非常感激让我知道。

score 1 · Accepted Answer

不幸的是，不可能为给定文档中的每个令牌分配主题。文档主题计数是“即时”计算/聚合的，因此文档令牌主题分布不会存储在任何地方。

r - 在 R text2vec 包中 - LDA 模型可以显示文档中每个标记的主题分布？

1 回答 1

Related

Reference