这是我在单个节点本地 Spark 集群上所做的事情:
git clone https://github.com/deeplearning4j/dl4j-spark-cdh5-examples.git
cd dl4j-spark-cdh5-examples.git
mvn package
export SPARK_WORKER_MEMORY=13g
spark-submit --class org.deeplearning4j.examples.cnn.MnistExample ./target/dl4j-spark-cdh5-examples-1.0-SNAPSHOT.jar
这就是我得到的:
Caused by: java.lang.OutOfMemoryError: Java heap space
这里是完整的堆栈跟踪:
spark-submit --class org.deeplearning4j.examples.cnn.MnistExample ./target/dl4j-spark-cdh5-examples-1.0-SNAPSHOT.jar 21:21:13,414 INFO ~ 加载数据....
警告:无法加载本机系统 BLAS ND4J 性能将降低 请安装本机 BLAS 库,例如 OpenBLAS 或 IntelMKL 详情请参阅http://nd4j.org/getstarted.html#open
21:21:20,571 信息 ~ 构建模型.... 21:21:20,776 警告 ~ 目标函数自动设置为最小化。在神经网络配置中设置 stepFunction 以更改默认设置。21:21:20,886 INFO ~ --- 开始网络训练 --- [Stage 0:> (0 + 6) / 6]
ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:74) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) 21:24:12,383 错误 ~ 任务 5 in阶段 0.0 失败 1 次;线程“主”org.apache.spark.SparkException 中的中止作业异常:作业因阶段失败而中止:阶段 0.0 中的任务 5 失败 1 次,最近失败:阶段 0.0 中丢失任务 5.0(TID 5,本地主机):java .lang.OutOfMemoryError:在 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1670) 在 java.io.ObjectInputStream.readObject0(ObjectInputStream) 的 java.lang.reflect.Array.newInstance(Array.java:70) 的 Java 堆空间.java:1344) 在 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) 在 java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:
驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1 的 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)。在 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala: 59) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 在 org.apache.spark.scheduler.DAGScheduler$$ 的 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop 的 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) .onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48 ) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 在 org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 在 org.apache.spark.SparkContext.runJob(SparkContext. scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD.count (RDD.scala:第1157章.spark.impl.multilayer.SparkDl4jMultiLayer.fitDataSet(SparkDl4jMultiLayer.java:239) 在 org.deeplearning4j.examples.cnn.MnistExample.main(MnistExample.java:132) 在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 在 sun .reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在 java.lang.reflect.Method.invoke(Method.java:606) 在 org.apache。 spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org. apache.spark。deploy.SparkSubmit$.submit(SparkSubmit.scala:206) 在 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) 在 org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)引起:java.lang.OutOfMemoryError:java.io.ObjectInputStream.readArray(ObjectInputStream.java:1670) 处 java.lang.reflect.Array.newInstance(Array.java:70) 处 java.io.ObjectInputStream 处的 Java 堆空间.readObject0(ObjectInputStream.java:1344) 在 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) 在 java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) 在 org.nd4j.linalg.api.buffer。 org.nd4j.linalg.api.buffer.BaseDataBuffer.readObject(BaseDataBuffer.java:868) 的 BaseDataBuffer.doReadObject(BaseDataBuffer.java:880) sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) 的 sun.reflect。DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData (ObjectInputStream.java:1893) 在 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) 在 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) 在 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java: 1990) 在 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) 在 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) 在 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) 在 java. io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) 在 java.io.ObjectInputStream。readSerialData(ObjectInputStream.java:1915) 在 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) 在 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) 在 java.io.ObjectInputStream.readArray(ObjectInputStream.java :1706) 在 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) 在 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) 在 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) 在 java .io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) 在 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) 在 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) 在 java.io.ObjectInputStream。 org.apache.spark.rdd 中的 defaultReadObject(ObjectInputStream.java:500)。ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:74) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) 21:24:12,769 错误 ~ 任务异常4.0 在阶段 0.0 (TID 4) org.apache.spark.TaskKilledException at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:204) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java :1145) 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 在 java.lang.Thread.run(Thread.java:744) 21:30:18,649 错误 ~ 线程线程中未捕获的异常- 3 org.apache.spark.SparkException:在 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118) 在 org.apache.spark.rpc.RpcEndpointRef 处发送消息 [message = StopBlockManagerMaster] 时出错。askWithRetry(RpcEndpointRef.scala:77) at org.apache.spark.storage.BlockManagerMaster.tell(BlockManagerMaster.scala:225) at org.apache.spark.storage.BlockManagerMaster.stop(BlockManagerMaster.scala:217) at org.apache .spark.SparkEnv.stop(SparkEnv.scala:97) at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1756) at org.apache.spark.util.Utils $.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.SparkContext.stop(SparkContext.scala:1755) at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala: 596)在 org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:267)在 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$ mcV$sp(ShutdownHookManager.scala:239) 在 org.apache.spark.util。SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$ sp$1.apply(ShutdownHookManager.scala:239) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$ mcV$sp(ShutdownHookManager.scala:239) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll $1.apply(ShutdownHookManager.scala:239)org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:239) 的 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) .spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239)org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:239) 的 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) .spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239)
有任何想法吗?