1

需要在spark中进行如下join操作

JavaPairRDD<String, Tuple2<Optional<MarkToMarketPNL>, Optional<MarkToMarketPNL>>> finalMTMPNLRDD = openMTMPNL.fullOuterJoin(closedMTMPNL);

要执行此操作,我需要两个 JavaPairRDD,它们是 closedMTMPNL 和 openMTMPNL。OpenMTM 和 closeMTM 工作正常,但两个 RDD 上的 keyBy 在运行时都出错。

JavaPairRDD<String,MarkToMarketPNL> openMTMPNL = openMTM.keyBy(new Function<MarkToMarketPNL,String>(){
                public String call(MarkToMarketPNL mtm) throws Exception
                {
                        return mtm.getTaxlot();
                }
            }); 

JavaPairRDD<String,MarkToMarketPNL> closedMTMPNL = closedMTM.keyBy(new Function<MarkToMarketPNL,String>(){
                    public String call(MarkToMarketPNL mtm) throws Exception
                    {
                        return mtm.getTaxlot();
                    }
                }); 

有没有其他方法可以加入 openMTM 和 closeMTM RDD?截至目前,试图获得两个可以在 String 上执行连接的 RDD。是什么导致异常发生?

附加堆栈跟踪

java.lang.NullPointerException
15/06/28 01:19:30 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
    at scala.collection.convert.Wrappers$JIterableWrapper.iterator(Wrappers.scala:53)
    at scala.collection.IterableLike$class.toIterator(IterableLike.scala:89)
    at scala.collection.AbstractIterable.toIterator(Iterable.scala:54)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
15/06/28 01:19:30 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.NullPointerException
    at scala.collection.convert.Wrappers$JIterableWrapper.iterator(Wrappers.scala:53)
    at scala.collection.IterableLike$class.toIterator(IterableLike.scala:89)
    at scala.collection.AbstractIterable.toIterator(Iterable.scala:54)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
4

3 回答 3

1

此异常是由于您的某个函数返回空值。您可以返回 null ,然后过滤 null 元组,例如:

JavaPairRDD<String,MarkToMarketPNL> openMTMPNL = openMTM.keyBy(new Function<MarkToMarketPNL,String>(){
            public String call(MarkToMarketPNL mtm) throws Exception
            {
                    return mtm.getTaxlot();
            }
        }).filter(new Function<Tuple2<String, MarkToMarketPNL>, Boolean>() {

        @Override
        public Boolean call(Tuple2<String, MarkToMarketPNL> arg) throws Exception {
            return arg == null ? false : true;
       }
    }); 
于 2015-08-19T13:35:12.980 回答
0

我认为错误不在您在问题中包含的代码中。Spark 正在尝试count在 RDD 上运行。您包含的代码没有调用count,所以这是一个标志。但异常表明被计数的 RDD 有一个用 Java 创建的迭代器,现在正在转换为 Scala 迭代器。到那时,事实证明这个迭代器实际上是null.

您的代码是否在某处产生迭代器?也许在mapPartitions通话中或类似的?

于 2015-06-27T20:26:54.223 回答
0

我也遇到过同样的问题。在内部执行连接操作时,会创建 <key,Iterable<values>>。如果 Iterable<values> 对象之一为空,我们会看到上面的空指针异常。

确保在执行联接之前没有任何值为空。

于 2017-01-27T02:01:56.690 回答