我正在使用主题建模对文档进行聚类。我需要想出最佳的主题编号。因此,我决定对主题 10、20、...60 进行十倍交叉验证。

我已将我的语料库分成十批,并留出一批作为保留集。我已经使用 9 个批次(总共 180 个文档)运行了潜在狄利克雷分配 (LDA),主题为 10 到 60。现在,我必须计算保留集的困惑度或对数似然度。

我从 CV 的一次讨论中找到了这段代码。下面几行代码我真的看不懂。我有使用保留集(20 个文档)的 dtm 矩阵。但我不知道如何计算这个保留集的困惑度或对数似然度。


  1. 谁能向我解释一下 seq(2, 100, by =1) 在这里是什么意思?另外,美联社[21:30] 是什么意思?函数(k)在这里做什么?

    best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
  2. 如果我想计算称为 dtm 的保持集的困惑度或对数似然度,是否有更好的代码?我知道有perplexity()logLik()功能,但由于我是新手,我无法弄清楚如何使用我的保持矩阵(称为 dtm)来实现它。

  3. 如何对包含 200 个文档的语料库进行十倍交叉验证?是否有我可以调用的现有代码?我caret为此目的找到了,但也无法弄清楚。


困惑度是衡量概率模型与一组新数据的匹配程度的度量。在topicmodelsR 包中,拟合函数很简单,该perplexity函数将先前拟合的主题模型和一组新数据作为参数,并返回一个数字。越低越好。

例如,将AssociatedPress数据拆分为训练集(75% 的行)和验证集(25% 的行):

# load up some R packages including a few we'll need later

data("AssociatedPress", package = "topicmodels")

burnin = 1000
iter = 1000
keep = 50

full_data  <- AssociatedPress
n <- nrow(full_data)
k <- 5

splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]

fitted <- LDA(train_set, k = k, method = "Gibbs",
                          control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300



将这个想法扩展到交叉验证是直截了当的。将数据分成不同的子集(比如 5 个),每个子集得到一圈作为验证集,四圈作为训练集的一部分。但是,它确实是计算密集型的,尤其是在尝试大量主题时。


下面的代码,即使在 7 个逻辑 CPU 上进行并行处理,也需要 3.5 小时才能在我的笔记本电脑上运行:

#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...

# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {

folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics

# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))

# we parallelize by the different number of topics.  A processor is allocated a value
# of k, and does the cross-validation serially.  This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
   k <- candidate_k[j]
   results_1k <- matrix(0, nrow = folds, ncol = 2)
   colnames(results_1k) <- c("k", "perplexity")
   for(i in 1:folds){
      train_set <- full_data[splitfolds != i , ]
      valid_set <- full_data[splitfolds == i, ]

      fitted <- LDA(train_set, k = k, method = "Gibbs",
                    control = list(burnin = burnin, iter = iter, keep = keep) )
      results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))

results_df <- as.data.frame(results)

ggplot(results_df, aes(x = k, y = perplexity)) +
   geom_point() +
   geom_smooth(se = FALSE) +
   ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
           "(ie five different models fit for each candidate number of topics)") +
   labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")

我们在结果中看到 200 个主题太多并且有些过拟合,而 50 个主题太少。在尝试的主题数量中,100 个是最好的,在五个不同的保留集上平均困惑度最低。


于 2017-01-04T21:47:42.410 回答


  1. seq(2, 100, by =1)只需创建一个从 2 到 100 的数字序列,因此 2、3、4、5、... 100。这些是我想在模型中使用的主题数。一个模型有 2 个主题,另一个模型有 3 个主题,另一个模型有 4 个主题,依此类推至 100 个主题。

  2. AssociatedPress[21:30]只是包中内置数据的一个子集topicmodels。我只是在该示例中使用了一个子集,以便它运行得更快。

关于最优主题数的一般问题,我现在按照 Martin Ponweiser 的例子,通过 Harmonic Mean 进行模型选择(他的论文中的 4.3.3,这里是:http: //epub.wu.ac.at/3558/1 /main.pdf)。这是我目前的做法:

# get some of the example data that's bundled with the package
data("AssociatedPress", package = "topicmodels")

harmonicMean <- function(logLikelihoods, precision=2000L) {
  llMed <- median(logLikelihoods)
  as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
                                       prec = precision) + llMed))))

# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50

fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )

# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:

logLiks <- fitted@logLiks[-c(1:(burnin/keep))]

# assuming that burnin is a multiple of keep and



# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))

# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L)  L@logLiks[-c(1:(burnin/keep))])

# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))

# inspect
plot(sequ, hm_many, type = "l")

# compute optimum number of topics
## 6

在此处输入图像描述 这是输出,x 轴上有主题数,表明 6 个主题是最佳的。

主题模型的交叉验证在包附带的文档中有很好的记录,例如参见这里:http : //cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf尝试然后返回有关使用主题模型对 CV 进行编码的更具体的问题。

于 2014-01-27T23:42:24.283 回答