python - 使用 Mallet Perplexity 的 Gensim 主题建模

Question

我是哈佛图书馆书名和主题的主题建模。

我使用 Gensim Mallet Wrapper 与 Mallet 的 LDA 进行建模。当我尝试获取 Coherence 和 Perplexity 值以查看模型有多好时，perplexity 无法计算，但出现以下异常。如果我使用 Gensim 的内置 LDA 模型而不是 Mallet，我不会得到同样的错误。我的语料库包含 7M+ 文档，长度不超过 50 个单词，平均 20 个单词。所以文档很短。

以下是我的代码的相关部分：

# TOPIC MODELING

from gensim.models import CoherenceModel
num_topics = 50

# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

困惑：-47.91929228302663

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

连贯性分数：0.28852857563541856

LDA 给出的分数没有问题。现在我用 MALLET 模拟同一个词袋

# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, 
corpus=corpus, num_topics=num_topics, id2word=id2word)

# Convert mallet to gensim type
mallet_model = 
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

连贯性分数：0.5994123896865993

然后我询问 Perplexity 值并获得低于警告和 NaN 值。

# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108：RuntimeWarning：在乘法得分中遇到无效值+ = np.sum（（self.eta - _lambda）* Elogbeta )

困惑：南

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109：RuntimeWarning：减分中遇到无效值+ = np.sum（gammaln（_lambda）-gammaln（self。埃塔））

我意识到这是一个非常 Gensim 特定的问题，需要对这个函数有更深入的了解：gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

因此，我将不胜感激对警告和 Gensim 域的任何评论。

score 1 · Accepted Answer

我不认为 Mallet 包装器实现了 perplexity 功能。如Radims answer中所述，困惑显示在标准输出中：

AFAIR，Mallet 向标准输出显示了困惑——这对你来说就足够了吗？以编程方式捕获这些值也应该是可能的，但我还没有研究过。希望 Mallet 也有一些用于 perplexity eval 的 API 调用，但它肯定不包含在包装器中。

我只是在一个样本语料库上运行它，并且确实每隔这么多迭代就打印了 LL/token：

LL/代币：-9.45493

困惑 = 2^(-LL/令牌) = 701.81

score 1 · Accepted Answer

从我这里几分钱。

看起来lda_model.log_perplexity(corpus)，您使用与训练相同的语料库。我可能会对语料库的保留/测试集有更好的运气。
lda_model.log_perplexity(corpus) 不返回困惑。它返回“绑定”。如果你想把它变成 Perplexity，做np.exp2(-bound). 我为此苦苦挣扎了一段时间:)
没有办法使用 Mallet 包装器来报告 Perplexity afaik

python - 使用 Mallet Perplexity 的 Gensim 主题建模

2 回答 2

Related

Reference