我正在尝试使用 oozie scheduler 运行 mahout 命令 - sequence2sparse,但它给出了一些错误。我尝试使用 oozie - shell 标签运行 mahout 命令,但没有任何效果。
以下是 oozie 工作流程 -
<action name="mahoutSeq2Sparse">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>mahout seq2sparse</exec>
<argument>-i</argument>
<argument>${nameNode}/tmp/Clustering/seqOutput</argument>
<argument>-o</argument>
<argument>${nameNode}/tmp/Clustering/seqToSparse</argument>
<argument>-ow</argument>
<argument>-nv</argument>
<argument>-x</argument>
<argument>100</argument>
<argument>-n</argument>
<argument>2</argument>
<argument>-wt</argument>
<argument>tf</argument>
<capture-output/>
</shell>
<ok to="brandCanopyInitialCluster" />
<error to="fail" />
</action>
我还尝试创建一个 shell 脚本并在 oozie 中运行它
<action name="mahoutSeq2Sparse">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${EXEC}</exec>
<file>${EXEC}#${EXEC}</file>
</shell>
<ok to="brandCanopyInitialCluster" />
<error to="fail" />
</action>
与 job.properties 作为
nameNode=hdfs://abc02:8020
jobTracker=http://abc02:8050/
clusteringJobInput=hdfs://abc02:8020/tmp/Activity/000000_0
queueName=default
oozie.wf.application.path=hdfs://abc02:8020/tmp/workflow/
oozie.use.system.libpath=true
EXEC=generatingBrandSparseFile.sh
和 generateBrandSparseFile.sh 是
export INPUT_PATH="hdfs://abc02:8020/tmp/Clustering/seqOutput"
export OUTPUT_PATH="hdfs://abc02:8020/tmp/Clustering/seqToSparse"
sudo -u hdfs hadoop fs -chmod -R 777 "hdfs://abc02:8020/tmp/Clustering/seqOutput"
mahout seq2sparse -i ${INPUT_PATH} -o ${OUTPUT_PATH} -ow -nv -x 100 -n 2 -wt tf
sudo -u hdfs hadoop fs -chmod -R 777 ${OUTPUT_PATH}
但没有一个选项有效。后一个的错误是 -
SLF4J:有关说明,请参见http://www.slf4j.org/codes.html#multiple_bindings。SLF4J:实际绑定的类型为 [org.slf4j.impl.Log4jLoggerFactory] sudo:不存在 tty,也没有指定 askpass 程序 15/06/05 12:23:59 WARN driver.MahoutDriver:在类路径上找不到 seq2sparse.props,将仅使用命令行参数 15/06/05 12:24:01 INFO vectorizer.SparseVectorsFromSequenceFiles:最大 n-gram 大小为:1
对于sudo: no tty present
这个错误,我已经注释掉 /etc/sudoers - Defaults !requiretty
Mahout 安装在安装 oozie 服务器的节点上。
以下 oozie 工作流程也无效-
<workflow-app xmlns="uri:oozie:workflow:0.4" name="map-reduce-wf">
<action name="mahoutSeq2Sparse">
<ssh>
<host>rootUserName@abc05.ad.abc.com<host>
<command>mahout seq2sparse</command>
<args>-i</arg>
<args>${nameNode}/tmp/Clustering/seqOutput</arg>
<args>-o</arg>
<args>${nameNode}/tmp/Clustering/seqToSparse</arg>
<args>-ow</args>
<args>-nv</args>
<args>-x</args>
<args>100</args>
<args>-n</args>
<args>2</args>
<args>-wt</args>
<args>tf</args>
<capture-output/>
</ssh>
<ok to="brandCanopyInitialCluster" />
<error to="fail" />
</action>
错误-Error: E0701 : E0701: XML schema error, cvc-complex-type.2.4.a: Invalid content was found starting with element 'ssh'. One of '{"uri:oozie:workflow:0.4":map-reduce, "uri:oozie:workflow:0.4":pig, "uri:oozie:workflow:0.4":sub-workflow, "uri:oozie:workflow:0.4":fs, "uri:oozie:workflow:0.4":java, WC[##other:"uri:oozie:workflow:0.4"]}' is expected.
在所有节点上安装 mahout 会有帮助吗?-(oozie 可以在任何节点上运行脚本)。有没有办法让 mahout 在 hadoop 集群上可用?
也欢迎任何其他解决方案。
提前致谢。
编辑:我稍微改变了方法,现在我直接调用 seq2sparse 类。工作流程是 -
<action name="mahoutSeq2Sparse">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles</main-class>
<arg>-i</arg>
<arg>${nameNode}/tmp/OozieData/Clustering/seqOutput</arg>
<arg>-o</arg>
<arg>${nameNode}/tmp/OozieData/Clustering/seqToSparse</arg>
<arg>-ow</arg>
<arg>-nv</arg>
<arg>-x</arg>
<arg>100</arg>
<arg>-n</arg>
<arg>2</arg>
<arg>-wt</arg>
<arg>tf</arg>
</java>
<ok to="CanopyInitialCluster"/>
<error to="fail"/>
</action>
作业仍然没有运行,错误是
>>> Invoking Main class now >>>
Main class : org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
Arguments :
-i
hdfs://abc:8020/tmp/OozieData/Clustering/seqOutput
-o
hdfs://abc:8020/tmp/OozieData/Clustering/seqToSparse
-ow
-nv
-x
100
-n
2
-wt
tf
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception, java.lang.IllegalStateException: Job failed!
org.apache.oozie.action.hadoop.JavaMainException: java.lang.IllegalStateException: Job failed!
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:58)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:39)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.IllegalStateException: Job failed!
at org.apache.mahout.vectorizer.DictionaryVectorizer.startWordCounting(DictionaryVectorizer.java:368)
at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:179)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:288)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:55)
... 15 more
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://vchniecnveg02:8020/user/root/oozie-oozi/0000054-150604142118313-oozie-oozi-W/mahoutSeq2Sparse--java/action-data.seq
Oozie Launcher ends