0

我创建字典累加器时发生了我的问题。我正在尝试使用数据框中的数据填充嵌套字典。字典是 2 层深,第二层有一个集合作为值,如下所示:

{
    key1: {
         key2: set() 
    }
}

我在 AWS sagemaker notebook 上运行我的代码,我需要它在那个环境中运行。我正在使用 PySpark 内核。我的累加器定义如下:

class DictAccumulator(AccumulatorParam):
    def zero(self, initialValue: dict):
        return initialValue

    def addInPlace(self, v1: dict, v2: dict):
        for key1, innerDict in v2.items():
            for key2, value in innerDict.items():
                v1[key1][key2].update(v1[key1][key2].union(value))
        return v1
dictAccum = spark.sparkContext.accumulator(defaultdict(lambda: defaultdict(set)), DictAccumulator())

上面的代码运行良好,我知道这是因为我在 sagemaker 笔记本中单独运行它。下一个片段会导致程序崩溃:

dataset.foreach(lambda x: dictAccum.add({ 
x[key1] : {
    x[key2] : x[value]
}}))

这是错误日志

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 65.0 failed 4 times, most recent failure: Lost task 20.3 in stage 65.0 (TID 823, ip-172-31-22-143.us-west-2.compute.internal, executor 7): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:486)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:475)
  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:593)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:586)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:392)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$handleEndOfDataSection$1.apply$mcVI$sp(PythonRunner.scala:461)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handleEndOfDataSection(PythonRunner.scala:460)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:590)
  ... 27 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
  at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
  at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  at py4j.Gateway.invoke(Gateway.java:282)
  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  at py4j.commands.CallCommand.execute(CallCommand.java:79)
  at py4j.GatewayConnection.run(GatewayConnection.java:238)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:486)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:475)
  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:593)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:586)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  ... 1 more
Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:392)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$handleEndOfDataSection$1.apply$mcVI$sp(PythonRunner.scala:461)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handleEndOfDataSection(PythonRunner.scala:460)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:590)
  ... 27 more

Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 583, in foreach
  self.rdd.foreach(f)
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/rdd.py", line 789, in foreach
  self.mapPartitions(processPartition).count()  # Force evaluation
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/rdd.py", line 1055, in count
  return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/rdd.py", line 1046, in sum
  return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/rdd.py", line 917, in fold
  vals = self.mapPartitions(func).collect()
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/rdd.py", line 816, in collect
  sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  return f(*a, **kw)
File "/mnt/yarn/usercache/livy/appcache/application_1625078952133_0005/container_1625078952133_0005_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
  format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 65.0 failed 4 times, most recent failure: Lost task 20.3 in stage 65.0 (TID 823, ip-172-31-22-143.us-west-2.compute.internal, executor 7): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:486)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:475)
  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:593)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:586)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:392)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$handleEndOfDataSection$1.apply$mcVI$sp(PythonRunner.scala:461)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handleEndOfDataSection(PythonRunner.scala:460)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:590)
  ... 27 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
  at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
  at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  at py4j.Gateway.invoke(Gateway.java:282)
  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  at py4j.commands.CallCommand.execute(CallCommand.java:79)
  at py4j.GatewayConnection.run(GatewayConnection.java:238)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:486)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$3.applyOrElse(PythonRunner.scala:475)
  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:593)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:586)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  ... 1 more
Caused by: java.io.EOFException
  at java.io.DataInputStream.readInt(DataInputStream.java:392)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$handleEndOfDataSection$1.apply$mcVI$sp(PythonRunner.scala:461)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
  at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handleEndOfDataSection(PythonRunner.scala:460)
  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:590)
  ... 27 more```
4

0 回答 0