伙计们,
我正在运行 pyspark 代码以从 hdfs 读取 500mb 文件并从文件内容构造一个 numpy 矩阵
集群信息:
9 个数据节点 128 GB 内存 /48 vCore CPU /节点
作业配置
conf = SparkConf().setAppName('test') \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '72g') \
.set('spark.driver.memory', '16g') \
.set('spark.yarn.executor.memoryOverhead',4096 ) \
.set('spark.dynamicAllocation.enabled', 'true') \
.set('spark.shuffle.service.enabled', 'true') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set('spark.driver.maxResultSize',10000) \
.set('spark.kryoserializer.buffer.max', 2044)
fileRDD=sc.textFile("/tmp/test_file.txt")
fileRDD.cache
list_of_lines_from_file = fileRDD.map(lambda line: line.split(" ")).collect()
错误
收集件正在吐出内存不足错误。
18/05/17 19:03:15 ERROR client.TransportResponseHandler: Still have 1
requests outstanding when connection fromHost/IP:53023 is closed
18/05/17 19:03:15 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.lang.OutOfMemoryError: Java heap space
任何帮助深表感谢。