1

我有这个数据处理:

library(text2vec)

##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){

  set.seed(17)
  lda_model2 <- LDA$new(n_topics = i)
  doc_topic_distr2 <- lda_model2$fit_transform(x = dtm,  progressbar = F)

  set.seed(17)
  sample.dtm2 <- itoken(rawsample$Abstract, 
                       preprocessor = prep_fun, 
                       tokenizer = tok_fun, 
                       ids = rawsample$id,
                       progressbar = F) %>%
    create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)

  set.seed(17)
  new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000, 
                                               convergence_tol = 0.001, n_check_convergence = 25, 
                                               progressbar = FALSE)

  perplex[i]  <- text2vec::perplexity(sample.dtm2, topic_word_distribution = 
                                        lda_model2$topic_word_distribution, 
                                      doc_topic_distribution = new_doc_topic_distr2) 

}
print(difftime(Sys.time(), t1, units = 'sec'))

我知道有很多这样的问题,但我一直无法准确找到我的情况的答案。在上图中,您可以看到潜在狄利克雷分配模型从 3 到 25 个主题编号的困惑度计算。我想获得其中最充分的值,这意味着我想找到肘部或膝盖,对于那些可能只被视为简单数字向量的值,其结果如下所示:

1   NA
2   NA
3   222.6229
4   210.3442
5   200.1335
6   190.3143
7   180.4195
8   174.2634
9   166.2670
10  159.7535
11  153.7785
12  148.1623
13  144.1554
14  141.8250
15  138.8301
16  134.4956
17  131.0745
18  128.8941
19  125.8468
20  123.8477
21  120.5155
22  118.4426
23  116.4619
24  113.2401
25  114.1233
plot(perplex)

这就是情节的样子

我会说肘部是 13 或 16,但我不完全确定,我想要确切的数字作为结果。我在这篇论文中看到 f''(x) / (1+f'(x)^2)^1.5 是膝盖公式,我这样尝试并说它是 18:

> d1 <- diff(perplex)                # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
  longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18

我无法完全弄清楚这件事。有人想分享我如何根据困惑作为结果获得确切的理想主题编号吗?

4

1 回答 1

0

发现这个:“具有最佳相干分数的 LDA 模型,通过肘法获得(具有最大绝对二阶导数的点)(...)”在本文中,所以这个编码可以工作:d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))

于 2019-11-15T22:06:34.297 回答