0

我正在尝试在 mahout 中的一些 tfidf 向量上运行 ssvd。当我在 Java 代码中运行它时(使用 mahout 0.6 jars),它工作正常:

public static void main(String[] args){
    runSSVDOnSparseVectors(vectorOutputPath
     + "/tfidf-vectors/part-r-00000", ssvdOutputPath, 1, 0, 30000, 1);
}

private static void runSSVDOnSparseVectors(String inputPath, String outputPath, 
                    int rank, int oversampling, int blocks,
                    int reduceTasks) throws IOException {
    Configuration conf = new Configuration();
    // get number of reduce tasks from config?
    SSVDSolver solver = new SSVDSolver(conf, new Path[] { new Path(
            inputPath) }, new Path(outputPath), blocks, rank, oversampling,
            reduceTasks);
    solver.setcUHalfSigma(true);
    solver.setcVHalfSigma(true);
    solver.run();
}

我决定将其转换为 bash 脚本并只使用 cli 命令,但是当我这样做时,我收到以下错误(在 0.5 和 0.7 版本上尝试过,但均无效。我可以尝试 0.6 但我没有不认为这是一个版本的东西):

[username@hostname lsa]$  $MAHOUT/mahout ssvd -i $H/test_lsa/v_out/tfidf-vectors -o $H/test_lsa/svd_out -k 1 -p 0 -r 30000 -t 1
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jar
12/07/23 15:00:47 INFO common.AbstractJob: Command line arguments: {--abtBlockHeight=[200000], --blockHeight=[30000], --broadcast=[true], --computeU=[true], --computeV=[true], --endPhase=[2147483647], --input=[/path/to/folder/test_lsa/v_out/tfidf-vectors], --minSplitSize=[-1], --outerProdBlockHeight=[30000], --output=[/path/to/folder/test_lsa/svd_out], --oversampling=[0], --pca=[false], --powerIter=[0], --rank=[1], --reduceTasks=[100], --startPhase=[0], --tempDir=[temp], --uHalfSigma=[false], --vHalfSigma=[false]}
12/07/23 15:00:49 INFO input.FileInputFormat: Total input paths to process : 100
Exception in thread "main" java.io.IOException: Q job unsuccessful.
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:230)
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:377)
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:141)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:171)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

我在集群上以分布式模式运行它。我读过 Q 作业失败可能与块大小有关,但我的大于 p+k。我也意识到我正在使用一个非常小的输入(4 个向量),但就像我说的,它可以在 java 代码中使用。我很困惑为什么它可以在 java 中工作,但不能在 CLI 中工作。我很确定我已经为函数提供了所有相同的参数。我总是可以将 java 代码打包到一个 jar 中并将其放入 bash 脚本中,但这会很 hacky...

该工作的日志说:

2012-07-23 15:00:55,413 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2012-07-23 15:00:55,417 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6ce53220
2012-07-23 15:00:55,638 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2012-07-23 15:00:55,697 ERROR org.apache.mahout.common.IOUtils: new m can't be less than n
java.lang.IllegalArgumentException: new m can't be less than n
    at     org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89)
    at org.apache.mahout.common.IOUtils.close(IOUtils.java:128)
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.    apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-07-23 15:00:55,731 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-07-23 15:00:55,733 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalArgumentException: new m can't be less than n
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233)
    at org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89)
    at org.apache.mahout.common.IOUtils.close(IOUtils.java:128)
    at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-07-23 15:00:55,736 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

我在这里先向您的帮助表示感谢。

4

1 回答 1

0

我实际上认为这是因为 tfidf-vectors 中有一些序列文件是空的,因为我使用了太多的 reducer。这对我来说似乎是一个错误。

于 2012-07-23T21:14:15.927 回答