python - 每次我在同一个语料库上训练时，LDA 模型都会生成不同的主题

Question

我正在使用 pythongensim从一个包含 231 个句子的小型语料库中训练一个潜在狄利克雷分配 (LDA) 模型。然而，每次我重复这个过程，它都会产生不同的主题。

为什么相同的LDA参数和语料每次都会产生不同的主题？

以及如何稳定主题生成？

我正在使用这个语料库（http://pastebin.com/WptkKVF0）和这个停用词列表（http://pastebin.com/LL7dqLcj），这是我的代码：

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("\t")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i

score 32 · Accepted Answer

为什么相同的LDA参数和语料每次都会产生不同的主题？

因为 LDA 在训练和推理步骤中都使用随机性。

以及如何稳定主题生成？

numpy.random通过在每次训练模型或执行推理时将种子重置为相同的值，使用numpy.random.seed：

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

（这很难看，它使 Gensim 结果难以重现；考虑提交补丁。我已经打开了一个问题。）

score 8 · Accepted Answer

在 LdaModel() 方法的初始化中设置random_state参数。

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=num_topics,
                                            random_state=1,
                                            passes=num_passes,
                                            alpha='auto')

score 2 · Accepted Answer

即使有大约 50,000 条评论，我也遇到了同样的问题。但是您可以通过增加 LDA 运行的迭代次数来获得更加一致的主题。它最初设置为 50，当我将其提高到 300 时，它通常会给我相同的结果，可能是因为它更接近收敛。

具体来说，您只需添加以下选项：

ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):

score 1 · Accepted Answer

这是由于其他人指出的 LDA 的概率性质。但是，我不认为将random_seed参数设置为固定数字是正确的解决方案。

绝对首先尝试增加迭代次数以确保您的算法收敛。即便如此，每个起点都可能使您处于不同的局部最小值。所以你可以在不设置的情况下多次运行 LDA random_seed，然后使用每个模型的 coherence score 比较结果。这可以帮助您避免次优的局部最小值。

GensimCoherenceModel已经为您实现了最常见的一致性指标，例如c_v、u_mass和c_npmi。

您可能会意识到这些会使结果更加稳定，但它们实际上并不能保证每次运行的结果都相同。但是，最好尽可能地达到全局最优，而不是因为固定的random_seedIMO 而陷入相同的局部最小值。

python - 每次我在同一个语料库上训练时，LDA 模型都会生成不同的主题

4 回答 4

Related

Reference