在 LDA 模型中,这是我认为使用现有模型推断新文档的两种方法。这两种方法有什么区别?
1 回答
我做了一些测试,其中我的 ldamodel 有 8 个主题,这里是我的结果: 2 文档来预测主题:
list_unseenTw=[['hope', 'miley', 'blow', 'peopl', 'mind', 'tonight', 'gain', 'million', 'fan'],['@mileycyrustour', "we'r", 'think', "it'", 'pretti', 'cool', 'miley', 'saturday', 'night', 'live', 'tonight', '#prettycool']]
使用 lda[doc_bow] 进行预测(它已经给出了匹配主题的百分比)
doc_bow = [dictionary.doc2bow(text) for text in list_unseenTw] predictions = ldamodel[doc_bow]
predictions[0]: [(0, 0.02509002728802024), (1, 0.0250114373070437), (2, 0.025040162139306051), (3, 0.82462688228515812), (4, 0.025150924341817767), (5, 0.025000027675139792), (6, 0.025000024127660267), (7 , 0.025080514835853926)]
predictions[1]: [(0, 0.031250011319462589), (1, 0.031250013721820222), (2, 0.031250019639505598), (3, 0.031250015093378707), (4, 0.031250019670816337), (5, 0.031250024860739675), (6, 0.78124988084026048), (7 , 0.031250014854016454)]
使用 ldamodel.inference 进行预测(结果以权重而不是百分比给出)
pred=ldamodel.inference(doc_bow)
打印(预)
(数组([[[[0.12545023,0.1250572,0.12520085,4.12309694,0.12579184,0.12500014,0.12500014,0.12500012,0.12540268]
如您所见,第一个预测 (doc1) 的结果与您所做的相同(主题 3):
total=0
for i in pred[0][0]:
total+=i
4.12309694/total = 0.82462%