1

I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.

While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?

Thank you for answering!

4

1 回答 1

1

统计学习的想法是“训练”数据和“测试”数据的基本分布或多或少相同。因此,如果您的新文档包含完全不同的分布,您就不能指望 LDA 会神奇地工作。对于任何其他模型都是如此。

在推理时间主题词分布是固定的(它是在训练阶段学习的)。所以get_top_words在模型训练后总是会返回相同的单词。

当然,新词不会自动包含在内 - 由词汇表(您在构建 DTM 之前学习)构建的 Document-Term 矩阵,新文档也将仅包含固定词汇表中的词。

于 2017-11-28T05:25:49.363 回答