mahout - 使用 Mahout 训练 LDA 并检索其主题

Question

我正在尝试 Apache Mahout 并且有很多关于如何使用 LDA 生成主题模型的信息，但是关于如何使用他们的新 CVB lda 算法来做同样的事情的信息很少。我想要做的是生成与原始主题类似的单词的概率ldatopic。

任何有关如何执行此操作的信息或示例将不胜感激！

谢谢！

更新：

好的，所以我解决了这个问题，但它仍然不完整，所以任何帮助都会很棒！

score 4 · Accepted Answer

好的，所以我仍然不知道如何输出主题，但是我已经弄清楚了如何获取 cvb 以及我认为是文档向量的方法，但是我没有运气转储它们，所以这里的帮助仍然会不胜感激！

哦，别忘了设置以下值：

export MAHOUT_HOME=/home/sgeadmin/mahout
export HADOOP_HOME=/usr/lib/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_CONF_DIR=$HADOOP_HOME/conf

在主人身上，否则这些都不起作用。

因此，首先使用 starclusters put 上传文档（显然，如果您不使用 starcluster，请跳过此 :)）：

starcluster put mycluster text_train /home/sgeadmin/
starcluster put mycluster text_test /home/sgeadmin/

然后我们需要将它们添加到 hadoop 的 hbase 文件系统（不要忘记 -hadoop starcluster）：

dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster

然后调用 Mahoutseqdirectory将文本转换为序列文件

$MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow

然后调用 Mahoutseq2parse将它们变成向量

$MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

最后 call cvb，我相信-dt标志说明推断的主题应该去哪里，但是因为我还没有能够转储它们，所以我无法确认这一点。

$MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict /user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document -mt /user/sgeadmin/text_states

-kflag 是主题的数量，flag-nt是字典的大小，您可以通过计算dictionary.file-0向量内部的条目数（在本例中为下/user/sgeadmin/text_vec）来计算它，并且-x是迭代次数。

如果有人知道如何从这里获取文档主题概率，我们将不胜感激！

score 2 · Accepted Answer

文档主题分布以序列文件格式存储在您使用-dt或--doc_topic_output运行时指定的目录下mahout cvb。在您的情况下，此目录将是/user/sgeadmin/text_cvb_document

要将这些序列文件的内容转储到文本文件，您可以使用mahout vectordump如下实用程序：

mahout vectordump -i /path/to/doc_topic_seq_input -o /path/to/doc_topic_text_out -p true -c csv

在哪里：

-i    Path to input directory containing document-topic distribution in sequence file format.
-o    Path to output file that will contain your document-topic distribution in text format.
-p    Key values will be displayed if this parameter is used.
-c    Output the Vector as CSV, otherwise it substitutes in the terms for vector cell entries

score 2 · Accepted Answer

After completing aboveprocess,you can obtain an output of the computed topics using another Mahout utility called LDAPrintTopics.java by passing following commands

--dict (-d) dict  --------->Dictionary to read in, in the same
                                           format as one created by
                                           org.apache.mahout.utils.vectors.lucen
                                           e.Driver
  --output (-o) output--------->Output directory to write top words
  --words (-w) words--------->Number of words to print
  --input (-i) input--------->Path to an LDA output (a state)
  --dictionaryType (-dt) dictionaryType--------->The dictionary file type
                                           (text|sequencefile)

mahout - 使用 Mahout 训练 LDA 并检索其主题

3 回答 3

Related

Reference