1

我正在读取 Databricks 笔记本中的数据框:

val data = files
.grouped(10000)
.toParArray
.map(subList => {
spark.read
  .format("avro")
  .schema(
    StructType(List(StructField("Body", BinaryType, nullable = true),
                    StructField("Properties", MapType(StringType, StringType, valueContainsNull = true), nullable = true))))
  .load(subList: _*)
})
.reduce(_ union _)
.select("body", "properties")
.as[Log]
.filter($"ipAddress" rlike "^([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})$")

这给出了一个数据集:

data:org.apache.spark.sql.Dataset[Log]
  timeStamp:timestamp
  Message:string
  HostIP:string
  hostName:string
  userIdentifier:string

我想对其进行某些操作,但是一旦我执行 count() 或 show() 或 write,我就会收到一些错误:

data.cache().toDF().count()->

org.apache.spark.SparkException: Job aborted due to stage failure: Task 24 in stage 157.0 failed 4 times, most recent failure: Lost task 24.3 in stage 157.0 (TID 29598, 10.139.64.14, executor 10): org.apache.spark.repl.RemoteClassLoaderError: line4f12f71540d949f69fe10e4bdb147a3937/$read$$iw.class
at org.apache.spark.repl.ExecutorClassLoader$$anon$1.toClassNotFound(ExecutorClassLoader.scala:150)
............
Caused by: java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)

有谁知道这些错误是什么以及如何解决?

谢谢

4

0 回答 0