我正在运行以下 Rscript gdp.R
#!/usr/bin/env Rscript
Sys.getenv(c("HADOOP_HOME", "HADOOP_CMD", "HADOOP_STREAMING", "HADOOP_CONF_DIR"))
library(rmr2)
library(rhdfs)
setwd("/root/somnath/GDP_data/")
gdp <- read.csv("GDP.csv")
head(gdp)
hdfs.init()
gdp.values <- to.dfs(gdp)
aaplRevenue = 156508
gdp.map.fn <- function(k,v) {
key <- ifelse(v[4] < aaplRevenue, "less", "greater")
keyval(key, 1)
}
count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}
count <- mapreduce(input=gdp.values, map=gdp.map.fn, reduce=count.reduce.fn)
from.dfs(count)
$val
并且无法克服 mapreduce 函数中的以下错误:
流式传输命令失败!mr 中的错误(map = map,reduce = reduce,combine = combine,vectorized.reduce,:hadoop 流式传输失败,错误代码为 1 调用:mapreduce -> mr
[root@kkws029 RHadoop_scripts]# Rscript gdp.R
Loading required package: methods
Loading required package: rJava
HADOOP_CMD=/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
Be sure to run hdfs.init()
CountryCode Number CountryName GDP
1 USA 1 UnitedStates 168000
2 CHN 2 China 9240270
3 JPN 3 Japan 4901530
4 DEU 4 Germany 3634823
5 FRA 5 France 2734949
6 GBR 6 UnitedKingdom 2522261
14/07/08 16:57:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
14/07/08 16:57:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/07/08 16:57:02 INFO compress.CodecPool: Got brand-new compressor [.deflate]
Warning messages:
1: In rmr.options("backend") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
2: In rmr.options("hdfs.tempdir") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
3: In rmr.options("backend") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
packageJobJar: [/tmp/hadoop-root/hadoop-unjar4163972761639211537/] [] /tmp/streamjob5583656452134995821.jar tmpDir=null
14/07/08 16:57:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/07/08 16:57:13 INFO mapred.FileInputFormat: Total input paths to process : 1
14/07/08 16:57:20 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
14/07/08 16:57:20 INFO streaming.StreamJob: Running job: job_201407071547_0024
14/07/08 16:57:20 INFO streaming.StreamJob: To kill this job, run:
14/07/08 16:57:20 INFO streaming.StreamJob: /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=kkws030.mara-ison.com:8021 -kill job_201407071547_0024
14/07/08 16:57:20 INFO streaming.StreamJob: Tracking URL: http://kkws030.mara-ison.com:50030/jobdetails.jsp?jobid=job_201407071547_0024
14/07/08 16:57:21 INFO streaming.StreamJob: map 0% reduce 0%
14/07/08 16:57:26 INFO streaming.StreamJob: map 50% reduce 0%
14/07/08 16:57:49 INFO streaming.StreamJob: map 100% reduce 100%
14/07/08 16:57:49 INFO streaming.StreamJob: To kill this job, run:
14/07/08 16:57:49 INFO streaming.StreamJob: /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=kkws030.mara-ison.com:8021 -kill job_201407071547_0024
14/07/08 16:57:49 INFO streaming.StreamJob: Tracking URL: http://kkws030.mara-ison.com:50030/jobdetails.jsp?jobid=job_201407071547_0024
14/07/08 16:57:49 ERROR streaming.StreamJob: Job not successful. Error: NA
14/07/08 16:57:49 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Calls: mapreduce -> mr
In addition: Warning messages:
1: In rmr.options("backend") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
2: In rmr.options("hdfs.tempdir") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
3: In rmr.options("backend") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
4: In rmr.options("backend.parameters") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
Execution halted
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 24: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 140: cygpath: command not found
/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop/libexec/../../hadoop-hdfs/bin/hdfs: line 172: exec: : not found
Warning messages:
1: In rmr.options("backend") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
2: In rmr.options("backend") :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = ...)
我的标准错误日志如下:
Loading objects:
.Random.seed
aaplRevenue
count.reduce.fn
gdp
gdp.map.fn
gdp.values
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Loading required package: rhdfs
Loading required package: methods
Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("rhdfs", "rJava", "methods", "rmr2", "stats", "graphics", :
can't load rhdfs
Loading required package: rmr2
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Warning in Ops.factor(left, right) : < not meaningful for factors
Error in split.default(a.list, ceiling(seq_along(a.list)/every.so.many), :
first argument must be a vector
Calls: <Anonymous> ... <Anonymous> -> do.call -> mapply -> split -> split.default
No traceback available
Error during wrapup:
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
任何建议将不胜感激
谢谢-S
注意:我已经在 stderr 日志中引用了错误,它指定了系统变量 HADOOP_CMD 未找到。有没有办法可以将 HADOOP 系统环境变量导出到 R?另请注意,我在脚本的开头使用 Sys.getenv(c("HADOOP_HOME", ...)) ,但这似乎不像 stderr 建议的那样工作
请注意,我已经在 ~/.bash_profile 中为 HADOOP 环境变量添加了以下导出命令
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
export JRI_PATH=/usr/lib64/R
export R_HOME=/usr/lib64/R
export R_SHARE_DIR=/usr/share/R
#export JRI_LD_PATH=${R_HOME}/library/rJava/jri:${R_HOME}/lib:${R_HOME}/bin
export LD_LIBRARY_PATH=${R_HOME}/library/rJava/jri
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CMD=${HADOOP_HOME}/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export JAVA_HOME=/var/jdk1.7.0_25
#PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
PATH=$PATH:$R_HOME/bin:$JAVA_HOME/bin:$LD_LIBRARY_PATH:/opt/cloudera/parcels/CDH/lib/mahout:/opt/cloudera/parcels/CDH/lib/hadoop:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce:/var/lib/storm-0.9.0-rc2/lib:$HADOOP_CMD:$HADOOP_STREAMING:$HADOOP_CONF_DIR
export PATH