mahout - 使用 lucene.vector 使用 mahout 对 solr 索引进行矢量化

Question

我正在尝试使用 Mahout 在 Amazon EMR 上运行集群作业。我有一个在 S3 上上传的 solr 索引，我想使用 mahouts lucene.vector 对其进行矢量化。（这是工作流程中的第一步）

该步骤的参数为：

罐子：s3n://mahout-bucket/jars/mahout-core-0.6-job.jar
MainClass：org.apache.mahout.driver.MahoutDriver
Args: lucene.vector --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors

日志中的错误是：

选择了未知程序“lucene.vector”。

我已经在本地使用 hadoop 和 Mahout 完成了相同的过程，并且效果很好。我应该如何在 EMR 上调用 lucene.vector 函数？

score 0 · Accepted Answer

程序名称，lucene.vector 应该紧跟在 bin/mahout 之后

/homes/cuneyt/trunk/bin/mahout lucene.vector --dir /homes/cuneyt/lucene/index --field 0 --output lda/vector --dictOut /homes/cuneyt/lda/dict.txt

score 0 · Accepted Answer

我终于找到了答案。问题是我使用了错误的 MainClass 参数。代替

org.apache.mahout.driver.MahoutDriver

我应该使用：

org.apache.mahout.utils.vectors.lucene.Driver

因此，正确的论点应该是

罐子：s3n://mahout-bucket/jars/mahout-core-0.6-job.jar 主类：
org.apache.mahout.utils.vectors.lucene.Driver
Args: --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors

2 回答 2