带有 BigInsights 的 python 版本目前是 2.6.6。如何使用不同版本的 Python 和在 yarn 上运行的 spark 作业?
请注意,BigInsights on cloud 的用户没有 root 访问权限。
带有 BigInsights 的 python 版本目前是 2.6.6。如何使用不同版本的 Python 和在 yarn 上运行的 spark 作业?
请注意,BigInsights on cloud 的用户没有 root 访问权限。
安装 Anaconda
此脚本在 BigInsights on cloud 4.2 Enterprise 集群上安装 anaconda python。请注意,这些说明不适用于 Basic 集群,因为您只能登录到 shell 节点,而不能登录任何其他节点。
SSH 进入 mastermanager 节点,然后运行(更改环境的值):
export BI_USER=snowch
export BI_PASS=changeme
export BI_HOST=bi-hadoop-prod-4118.bi.services.us-south.bluemix.net
接下来运行以下。该脚本尝试尽可能地具有幂等性,因此多次运行它并不重要:
# abort if the script encounters an error or undeclared variables
set -euo
CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET https://${BI_HOST}:9443/api/v1/clusters | python -c 'import sys, json; print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
echo Cluster Name: $CLUSTER_NAME
CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts = [ item["Hosts"]["host_name"] for item in items ]; print(" ".join(hosts));')
echo Cluster Hosts: $CLUSTER_HOSTS
wget -c https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
# Install anaconda if it isn't already installed
[[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
# You can install your pip modules using something like this:
# ${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary
# Install anaconda on all of the cluster nodes
for CLUSTER_HOST in ${CLUSTER_HOSTS};
do
if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
then
echo "*** Processing $CLUSTER_HOST ***"
ssh $BI_USER@$CLUSTER_HOST "wget -q -c https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b"
# You can install your pip modules on each node using something like this:
# ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
# Set the PYSPARK_PYTHON path on all of the nodes
ssh $BI_USER@$CLUSTER_HOST "grep '^export PYSPARK_PYTHON=' ~/.bash_profile || echo export PYSPARK_PYTHON=${HOME}/anaconda2/bin/python2.7 >> ~/.bash_profile"
ssh $BI_USER@$CLUSTER_HOST "sed -i -e 's;^export PYSPARK_PYTHON=.*$;export PYSPARK_PYTHON=${HOME}/anaconda2/bin/python2.7;g' ~/.bash_profile"
ssh $BI_USER@$CLUSTER_HOST "cat ~/.bash_profile"
fi
done
echo 'Finished installing'
运行 pyspark 作业
如果你使用的是pyspark,可以使用anaconda python,在运行pyspark命令前设置以下变量:
export SPARK_HOME=/usr/iop/current/spark-client
export HADOOP_CONF_DIR=/usr/iop/current/hadoop-client/conf
# set these to the folders where you installed anaconda
export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7
spark-submit --master yarn --deploy-mode client ...
# NOTE: --deploy-mode cluster does not seem to use the PYSPARK_PYTHON setting
...
齐柏林飞艇(可选)
如果您使用的是 Zeppelin(按照 BigInsights on cloud 的这些说明),请在 zeppelin_env.sh 中设置以下变量:
# set these to the folders where you installed anaconda
export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7