2

我正在尝试设置一个伪分布式 Hadoop 2.6 集群来运行 Giraph 作业。由于我找不到一个全面的指南,我一直依赖 Giraph QuickStart ( http://giraph.apache.org/quick_start.html ),不幸的是它适用于 Hadoop 0.20.203.0 和一些 Hadoop 2.6/YARN 教程。为了做正确的事,我想出了一个应该安装 Hadoop 和 Giraph 的 bash 脚本。不幸的是,Giraph 作业因“输入拆分期间工作人员失败”异常而反复失败。如果有人能在我的部署过程中指出错误或提供另一种工作方式,我将不胜感激。

编辑:我的主要目标是能够开发 Giraph 1.1 工作。我不需要自己运行任何繁重的计算(最终,作业将在外部集群上运行),所以如果有任何更简单的方法来拥有 Giraph 开发环境,它就可以了。

安装脚本如下:

#! /bin/bash
set -exu

echo "Starting hadoop + giraph installation; JAVA HOME is $JAVA_HOME"

INSTALL_DIR=~/apache_hadoop_giraph


mkdir -p $INSTALL_DIR/downloads

############# PHASE 1: YARN ##############

#### 1: Get and unpack Hadoop:

if [ ! -f $INSTALL_DIR/downloads/hadoop-2.6.0.tar.gz ]; then
  wget -P $INSTALL_DIR/downloads ftp://ftp.task.gda.pl/pub/www/apache/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
fi
tar -xf $INSTALL_DIR/downloads/hadoop-2.6.0.tar.gz -C $INSTALL_DIR

export HADOOP_PREFIX=$INSTALL_DIR/hadoop-2.6.0
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop


#### 2: Configure Hadoop and YARN

sed -i -e "s|^export JAVA_HOME=\${JAVA_HOME}|export JAVA_HOME=$JAVA_HOME|g" ${HADOOP_PREFIX}/etc/hadoop/hadoop-env.sh

cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
EOF

cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
EOF

cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
EOF

cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
EOF

#### 3: Prepare HDFS:

cd $HADOOP_PREFIX
export HDFS=$HADOOP_PREFIX/bin/hdfs

sbin/stop-all.sh # Just to be sure we have no running demons

# The following line is commented out in case some of SO readers have something important in /tmp:
# rm -rf /tmp/* || echo "removal of some parts of tmp failed"

$HDFS namenode -format
sbin/start-dfs.sh


#### 4: Create HDFS directories:
$HDFS dfs -mkdir -p /user
$HDFS dfs -mkdir -p /user/`whoami`



#### 5 (optional): Run a test job

sbin/start-yarn.sh
$HDFS dfs -put etc/hadoop input
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
$HDFS dfs -cat output/*   # Prints some stuff grep'd out of input file 
sbin/stop-yarn.sh

#### 6: Stop HDFS for now
sbin/stop-dfs.sh





############# PHASE 2: Giraph ##############

#### 1: Get Giraph 1.1

export GIRAPH_HOME=$INSTALL_DIR/giraph
cd $INSTALL_DIR
git clone http://git-wip-us.apache.org/repos/asf/giraph.git giraph
cd $GIRAPH_HOME
git checkout release-1.1

#### 2: Build 

mvn -Phadoop_2 -Dhadoop.version=2.6.0 -DskipTests package 


#### 3: Run a test job:

# Remove leftovers if any:
$HADOOP_HOME/sbin/start-dfs.sh
$HDFS dfs -rm -r -f /user/`whoami`/output
$HDFS dfs -rm -r -f /user/`whoami`/input/tiny_graph.txt
$HDFS dfs -mkdir -p /user/`whoami`/input

# Place input:
$HDFS dfs -put tiny_graph.txt input/tiny_graph.txt

# Start YARN
$HADOOP_HOME/sbin/start-yarn.sh

# Run the job (this fails with 'Worker failed during input split'):
JAR=$GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-for-hadoop-2.6.0-jar-with-dependencies.jar
CORE=$GIRAPH_HOME/giraph-core/target/giraph-1.1.0-for-hadoop-2.6.0-jar-with-dependencies.jar
$HADOOP_HOME/bin/hadoop jar $JAR \
         org.apache.giraph.GiraphRunner \
         org.apache.giraph.examples.SimpleShortestPathsComputation \
         -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
         -vip /user/ptaku/input/tiny_graph.txt \
         -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
         -op /user/ptaku/output/shortestpaths \
         -yj $JAR,$CORE \
         -w 1 \
         -ca giraph.SplitMasterWorker=false

脚本顺利运行到最后一条命令,长时间挂起map 100% reduce 0%状态;对 YARN 容器的日志文件的调查揭示了神秘的java.lang.IllegalStateException: coordinateVertexInputSplits: Worker failed during input split (currently not supported). pastebin 提供完整的容器日志:

容器 1(主): http: //pastebin.com/6nYvtNxJ

容器 2(工人): http: //pastebin.com/3a6CQamQ

我还尝试使用hadoop_yarn配置文件构建 Giraph(在从 pom.xml 中删除 STATIC_SASL_SYMBOL 之后),但它没有改变任何东西。

我正在运行具有 4GB RAM 和 16GB 交换空间的 Ubuntu 14.10 64bit。额外的系统信息:

>> uname -a
Linux Graffi 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>> which java
/usr/bin/java
>> java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
>> echo $JAVA_HOME
/usr/lib/jvm/java-7-openjdk-amd64/jre
>> which mvn
/usr/bin/mvn
>> mvn --version
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.13.0-35-generic", arch: "amd64", family: "unix"

对于如何让 Giraph 1.1 在 Hadoop 2.6 上运行的任何帮助,我将不胜感激。

4

1 回答 1

1

不久前我遇到了类似的问题。问题是我的计算机在主机名中有大写字母,这是一个已知错误(https://issues.apache.org/jira/browse/GIRAPH-904)。将主机名更改为仅小写字母为我修复了它。

于 2015-03-09T13:41:13.677 回答