1

我正在尝试在使用 Apache Toree - Scala 内核运行的 Jupyter Notebook 中使用 Apache Spark 读取 Kudu 表。

Spark 版本:2.2.0 Scala 版本:2.11 Apache Toree 版本:0.3

这是我用来读取 Kudu 表的代码

val kuduMasterAddresses = KUDU_MASTER_ADDRESSES_HERE
val kuduMasters: String = Seq(kuduMasterAddresses).mkString(",")

val kuduContext = new KuduContext(kuduMasters, spark.sparkContext)

val table = TABLE_NAME_HERE

def readKudu(table: String) = {
    val tableKuduOptions: Map[String, String] = Map(
    "kudu.table"  -> table,
    "kudu.master" -> kuduMasters
    )
    spark.sqlContext.read.options(tableKuduOptions).kudu
}

val kuduTableDF = readKudu(table)

使用kuduContext.tableExists(table)返回true。使用kuduTableDF.columns会给出一个具有正确列名的字符串数组。

当我尝试应用计数、显示等操作时会出现问题......当前异常被抛出:

名称:org.apache.spark.SparkException 消息:作业因阶段失败而中止:获取任务结果时出现异常:java.io.IOException:java.lang.ClassNotFoundException:org.apache.kudu.spark.kudu.KuduContext$TimestampAccumulator StackTrace : 在 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1567) 在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( DAGScheduler.scala:1555) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1554) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1554) 在 org.apache.spark.scheduler 的 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)。DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at scala.Option.foreach(Option. scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1782) at org.apache.spark.scheduler .DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1737) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1726) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala :48)257) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1782) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop 的 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803) .onReceive(DAGScheduler.scala:1737) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1726) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48 )257) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1782) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop 的 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803) .onReceive(DAGScheduler.scala:1737) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1726) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48 )运行(EventLoop.scala:48)运行(EventLoop.scala:48)
在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619) 在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2031) 在 org.apache.spark.SparkContext.runJob(SparkContext.scala :2052) 在 org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) 在 org.apache.spark.sql 的 org.apache.spark.SparkContext.runJob(SparkContext.scala:2071)。 execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2865) at org.apache.spark.sql .Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql .Dataset$$anonfun$55.apply(Dataset.scala:2846) 在 org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) 在 org.apache。spark.sql.Dataset.withAction(Dataset.scala:2845) at org.apache.spark.sql.Dataset.head(Dataset.scala:2154) at org.apache.spark.sql.Dataset.take(Dataset.scala:第2367章.show(Dataset.scala:600) 在 org.apache.spark.sql.Dataset.show(Dataset.scala:609)

我已经在 Apache Toree 笔记本中使用了AddDeps魔法,如下所示:

%AddDeps org.apache.kudu kudu-spark2_2.11 1.6.0 --transitive --trace
%AddDeps org.apache.kudu kudu-client 1.6.0 --transitive --trace

执行以下导入没有问题:

import org.apache.kudu.spark.kudu._
4

0 回答 0