scikit-learn - 来自潜在狄利克雷分配的 sklearn 似然性

Question

我想使用 sklearn 的潜在狄利克雷分配进行异常检测。我需要获得公式中正式描述的新样本的可能性here。

我怎么能得到那个？

score 1 · Accepted Answer

解决您的问题

您应该使用模型的score()方法，该方法返回传入文档的对数可能性。

假设您已根据论文创建了文档并为每个主机训练了 LDA 模型。然后，您应该从所有培训文档中获得最低可能性并将其用作阈值。未经测试的示例代码如下：

import numpy as np
from sklearn.decomposition import LatentDirichletAllocation

# Assuming X contains a host's training documents
# and X_unknown contains the test documents
lda = LatentDirichletAllocation(... parameters here ...)
lda.fit(X)
threshold = min([lda.score([x]) for x in X])
attacks = [
    i for i, x in enumerate(X_unknown)
    if lda.score([x]) < threshold
]

# attacks now contains the indexes of the anomalies

正是你问的

如果你想在你链接的论文中使用精确的方程，我建议不要在 scikit-learn 中尝试这样做，因为期望步骤界面不清楚。

参数θ和φ可以在第112-130行找到。该函数返回 doc_topic_distribution 和充分的统计数据，您可以尝试通过以下未经测试的代码从中推断θ和φ ：doc_topic_dnorm_phi_update_doc_distribution()

theta = doc_topic_d / doc_topic_d.sum()
# see the variables exp_doc_topic_d in the source code
# in the function _update_doc_distribution()
phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS

另一个图书馆的建议

如果您想更好地控制期望和最大化步骤以及变分参数，我建议您查看LDA++，特别是EStepInterface（免责声明，我是 LDA++ 的作者之一）。

scikit-learn - 来自潜在狄利克雷分配的 sklearn 似然性

1 回答 1

解决您的问题

正是你问的

另一个图书馆的建议

Related

Reference