hadoop - Spark/YARN - not all nodes are used in spark-submit

Question

I have a Spark/YARN cluster with 3 slaves setup on AWS.

I spark-submit a job like this: ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster my.py And the final result is a file containing all the hostnames from all the slaves in a cluster. I was expecting I get a mix of hostnames in the output file, however, I only see one hostname in the output file. That means YARN never utilize the other slaves in the cluster.

Am I missing something in the configuration?

I have also included my spark-env.sh settings below.

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop/

SPARK_EXECUTOR_INSTANCES=3
SPARK_WORKER_CORES=3

my.py

import socket
import time
from pyspark import SparkContext, SparkConf

def get_ip_wrap(num):
    return socket.gethostname()

conf = SparkConf().setAppName('appName')
sc = SparkContext(conf=conf)

data = [x for x in range(1, 100)]
distData = sc.parallelize(data)

result = distData.map(get_ip_wrap)
result.saveAsTextFile('hby%s'% str(time.time()))

score 0 · Accepted Answer

在我更新了以下设置或 spark-env.sh 后，所有从站都被使用了。

SPARK_EXECUTOR_INSTANCES=3
SPARK_EXECUTOR_CORES=8

hadoop - Spark/YARN - not all nodes are used in spark-submit

1 回答 1

Related

Reference