我使用 Spark 笛卡尔函数来生成列表 N 对值。
然后我映射这些值以生成每个用户之间的距离度量:
val cartesianUsers: org.apache.spark.rdd.RDD[(distance.classes.User, distance.classes.User)] = users.cartesian(users)
cartesianUsers.map(m => manDistance(m._1, m._2))
这按预期工作。
使用 Spark Streaming 库,我创建了一个 DStream,然后对其进行映射:
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream....
customReceiverStream.foreachRDD(m => {
println("size is " + m)
})
我可以在 customReceiverStream.foreachRDD 中使用笛卡尔函数,但根据文档http://spark.apache.org/docs/1.2.0/streaming-programming-guide.htm这不是它的预期用途:
foreachRDD(func) 应用函数的最通用的输出运算符,func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
如何计算 DStream 的笛卡尔?也许我误解了 DStreams 的使用?