machine-learning - 如何理解 Mallet 中 Topic Model 类的输出？

Question

当我在主题建模开发人员指南中尝试示例代码时，我真的很想了解该代码输出的含义。

首先在运行过程中，它给出：

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

所以问题1：第一行的“编码LDA：10个主题，4个主题位，1111个主题掩码”是什么意思？我只知道“10 个主题”是关于什么的。

问题2：“ <10> LL/token：-9,24097 <20> LL/token：-9,1026 <30> LL/token：-8,95386 <40> LL/token：- 8,75353" 是什么意思？这似乎是 Gibss 采样的一个指标。但它不是单调递增的吗？

之后，将打印以下内容：

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

这部分的第一行可能是token-topic assignment，对吧？

问题3：对于第一个主题，

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0.008 被称为“主题分布”，是不是这个主题在整个语料库中的分布？然后好像有冲突：如上图的topic 0，其token会在copus中出现8+7+6+4+4+...次；相比之下，主题 7 在语料库中有 4+3+3+3+3... 次被识别。结果，主题 7 的分布应该低于主题 0。这是我无法理解的。更进一步，最后那个“0 0.55”是什么？

非常感谢您阅读这篇长文。希望您能回答它，并希望这对其他对 Mallet 感兴趣的人有所帮助。

最好的

score 7 · Accepted Answer

我认为我知道的不够多，无法给出一个非常完整的答案，但这里只是其中的一部分……对于第一季度，您可以检查一些代码以查看这些值是如何计算的。对于 Q2，LL 是模型的对数似然除以代币总数，这是对模型给出数据的可能性的度量。增加值意味着模型正在改进。这些也可R用于主题建模的包中。Q2，是的，我认为第一行是正确的。Q3，好问题，对我来说不是很清楚，也许 (x) 是某种索引，令牌频率似乎不太可能......大概其中大多数是某种诊断。

可以获得一组更有用的诊断信息，bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml从而产生大量的主题质量度量。他们绝对值得一试。

有关所有这一切的完整故事，我建议给普林斯顿大学的 David Mimno 写一封电子邮件，他是 MALLET 的（主要？）维护者，或者通过http://blog.gmane.org/gmane上的列表给他写信.comp.ai.mallet.devel然后在这里为我们这些对 MALLET 内部运作感到好奇的人发布答案......

score 4 · Accepted Answer

我的理解是：

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0 是主题编号。
0.008 是此类主题的权重
Battle (8) union (7) [...] 是此类主题中的热门关键词。数字是主题中单词的出现次数。

然后，作为结果，您还会获得一个 .csv 文件。我认为它包含了过程中最重要的数据。您会发现每一行的值如下所示：

0   0   285 10   page make items thing work put dec browsers recipes expressions

那是：

树级别
主题 ID
总字数
文件总数
前 10 个字

有点晚了，但我希望它可以帮助某人

score 1 · Accepted Answer

对于问题 3，我相信 0.008（“主题分布”）与文档的主题分布的先验 α 有关。Mallet 对此进行了优化，本质上允许某些主题承载更多“权重”。Mallet 似乎在估计主题 0 只占您语料库的一小部分。

令牌计数仅表示计数最高的单词。例如，主题 0 的剩余计数可能是 0，而主题 9 的剩余计数可能是 3。因此，即使排名靠前的词的计数是降低。

我必须在最后检查“0 0.55”的代码，但这可能是优化的 \beta 值（我很确定这不是不对称的）。

machine-learning - 如何理解 Mallet 中 Topic Model 类的输出？

3 回答 3

Related

Reference