我目前正在使用带有 Spark 1.2.0 连接器的 Apache Cassandra 2.1.2 集群。对于一些初始测试,我需要通过 spark-shell 中的 Spark SQL 命令从 Cassandra 表中选择一些行。
我们在键空间ks中使用了一个名为tabletest的表。该表包含例如一个id (bigint)和一个ts (timestamp)。
这是我的火花脚本:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("ks")
val rdd = cc.sql("SELECT id,ts FROM tabletest LIMIT 100")
rdd.toArray.foreach(println)
当我通过命令执行此脚本时:
spark-shell -i myscript
一切正常,直到一行包含 ts 单元格的空值。如果有一行 ts 的值为空,我会遇到几个异常,这些异常与 spark 正在等待一个长值(8 个字节)并且没有得到任何字节有关。即使我尝试在不显示行的情况下计算行数,我也会遇到同样的问题。
15/01/29 15:21:35 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
com.datastax.driver.core.exceptions.InvalidTypeException: Invalid 64-bits long value, expecting 8 bytes but got 0
at com.datastax.driver.core.TypeCodec$LongCodec.deserializeNoBoxing(TypeCodec.java:452)
at com.datastax.driver.core.TypeCodec$DateCodec.deserialize(TypeCodec.java:826)
at com.datastax.driver.core.TypeCodec$DateCodec.deserialize(TypeCodec.java:748)
at com.datastax.driver.core.DataType.deserialize(DataType.java:606)
at com.datastax.spark.connector.AbstractGettableData$.get(AbstractGettableData.scala:88)
at org.apache.spark.sql.cassandra.CassandraSQLRow$$anonfun$fromJavaDriverRow$1.apply$mcVI$sp(CassandraSQLRow.scala:42)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.sql.cassandra.CassandraSQLRow$.fromJavaDriverRow(CassandraSQLRow.scala:41)
at org.apache.spark.sql.cassandra.CassandraSQLRow$CassandraSQLRowReader$.read(CassandraSQLRow.scala:49)
at org.apache.spark.sql.cassandra.CassandraSQLRow$CassandraSQLRowReader$.read(CassandraSQLRow.scala:46)
at com.datastax.spark.connector.rdd.CassandraRDD$$anonfun$13.apply(CassandraRDD.scala:378)
at com.datastax.spark.connector.rdd.CassandraRDD$$anonfun$13.apply(CassandraRDD.scala:378)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:13)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
我该如何处理这样的空值,我是否必须在我的 SQL 查询中使用一些函数来用默认值替换空值,或者我可以在我的脚本中使用一些方法或参数来允许 spark 处理这样的空值?
谢谢你的帮助,
最好的
尼古拉斯