2

H2O 苏打水经常抛出异常,每当发生这种情况时,我们都会手动重新运行它。问题是发生此异常时火花作业不退出,它们不返回退出状态,我们无法自动化此过程。

App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 316 in stage 22.0 failed 4 times, most recent failure: Lost task 316.3 in stage 22.0 (TID 9470, ip-**-***-***-**.ec2.internal): java.lang.ArrayIndexOutOfBoundsException: 65535
App > at water.DKV.get(DKV.java:202)
App > at water.DKV.get(DKV.java:175)
App > at water.Key.get(Key.java:83)
App > at water.fvec.Frame.createNewChunks(Frame.java:896)
App > at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
App > at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
App > at org.apache.spark.h2o.backends.internal.InternalWriteConverterContext.createChunks(InternalWriteConverterContext.scala:28)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$class.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:86)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:67)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:67)
App > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
App > at org.apache.spark.scheduler.Task.run(Task.scala:85)
App > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
App > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
App > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
4

1 回答 1

0

此问题正在苏打水项目的以下问题中进行调查:

它似乎与数据的大小有关。

当我们尝试将一个巨大的 spark 数据帧拉到 h2o 帧时,就会发生这种情况。63m 条记录 x 6300 列。虽然 H2O/Sparkling Water 集群大小合适:(有 40 个执行器 x 17g 内存,每个 Spark 执行器有 4 个线程/内核)所以总内存量为 680Gb

我们从未在较小的数据集上遇到此错误。

于 2018-03-06T06:56:23.713 回答