我正在尝试为不同数量的主题计算 Spyder 中的困惑度分数,以便使用 gensim 找到最佳模型参数。
然而,困惑度分数并没有像预期的那样降低 [1]。此外,似乎有更多的人遇到这个确切的问题,但据我所知没有解决方案。
有谁知道如何解决这个问题?
代码:
X_train, X_test = train_test_split(corpus, train_size=0.9, test_size=0.1, random_state=1)
topic_range = [10, 20, 25, 30, 40, 50, 60, 70, 75, 90, 100, 150, 200]
def lda_function(X_train, X_test, dictionary, nr_topics):
ldamodel2 = gensim.models.LdaModel(X_train,
id2word=dictionary,
num_topics=nr_topics,
alpha='auto',
eta=0.01,
passes=10
iterations=500,
random_state=42)
return 2**(-1*ldamodel2.log_perplexity(X_test))
log_perplecs = [lda_function(X_train, X_test, dictionary, nr_topics=topic) for topic in topic_range]
print("\n",log_perplecs)
fig1, ax1 = plt.subplots(figsize=(7,5))
ax1.scatter(x=topic_range, y=log_perplecs)
fig1.tight_layout()
fig1.savefig(output_directory + "Optimal Number of Topics (Perplexity Score).pdf", bbox_inches = 'tight')```
[1]: https://i.stack.imgur.com/jFiF1.png