statistics - 主题数未知的潜在狄利克雷分配

Question

我正在寻找一种类似于 LDA 的技术，但不知道有多少“混合物”是最佳的——有什么可以做到这一点吗？

score 6 · Accepted Answer

有两种方法可以解决这个问题，一种很简单但很简单；另一个更有动力但更复杂。从前者开始，人们可以简单地尝试一系列k（主题数），并比较在每个这些下观察到的数据的可能性。根据您的情况，您可能希望对更多的主题进行惩罚——或者您可以明确地在k上放置一个先验分布（即，以主观预期的集群数量为中心的正态分布）。在任何情况下，您都只需选择使可能性最大化的k 。

更有原则的方法是使用贝叶斯非参数，特别是在主题模型的情况下使用狄利克雷过程。看看这篇论文。我确实相信这里有一个可用的实现，尽管我没有太多研究它。

score 2 · Accepted Answer

As Byron said, the simplest way to do this is to compare likelihoods for different values of k. However, if you take care to consider the probability of some held-out data (i.e. not used to induce the model), this naturally penalises overfitting and so you don't need to normalise for k. A simple way to do this is to take your training data and split it into a training set and a dev set, and do a search over a range of plausible k values, inducing models from the training set and then computing dev set probability given the induced model.

It's worth mentioning that computing the likelihood exactly under LDA is intractable, so you're going to need to use approximate inference. This paper goes into this in depth, but if you use a standard LDA package (I'd recommend mallet: http://mallet.cs.umass.edu/) they should have this functionality already.

The non-parametric version is indeed the correct way to go, but inference in non-parametric models is computationally expensive, so I would hesitate to pursue this unless the above doesn't work.

statistics - 主题数未知的潜在狄利克雷分配

2 回答 2

Related

Reference