0

我正在尝试为不同数量的主题计算 Spyder 中的困惑度分数,以便使用 gensim 找到最佳模型参数。

然而,困惑度分数并没有像预期的那样降低 [1]。此外,似乎有更多的人遇到这个确切的问题,但据我所知没有解决方案。

有谁知道如何解决这个问题?

代码:

X_train, X_test = train_test_split(corpus, train_size=0.9, test_size=0.1, random_state=1)

topic_range = [10, 20, 25, 30, 40, 50, 60, 70, 75, 90, 100, 150, 200]

def lda_function(X_train, X_test, dictionary, nr_topics):
    ldamodel2 = gensim.models.LdaModel(X_train,
                                       id2word=dictionary,
                                       num_topics=nr_topics,
                                       alpha='auto',
                                       eta=0.01,
                                       passes=10
                                       iterations=500, 
                                       random_state=42)
    return 2**(-1*ldamodel2.log_perplexity(X_test))

log_perplecs = [lda_function(X_train, X_test, dictionary, nr_topics=topic) for topic in topic_range]

print("\n",log_perplecs)

fig1, ax1 = plt.subplots(figsize=(7,5))
ax1.scatter(x=topic_range, y=log_perplecs)
fig1.tight_layout()

fig1.savefig(output_directory + "Optimal Number of Topics (Perplexity Score).pdf", bbox_inches = 'tight')```




  [1]: https://i.stack.imgur.com/jFiF1.png
4

0 回答 0