给定一个 RDD(数据)和一个要计算熵的索引字段列表。执行以下流程时,在 2MB(16k 行)源上计算单个熵值大约需要 5 秒。
def entropy(data: RDD[Array[String]], colIdx: Array[Int], count: Long): Double = {
println(data.toDebugString)
data.map(r => colIdx.map(idx => r(idx)).mkString(",") -> 1)
.reduceByKey(_ + _)
.map(v => {
val p = v._2.toDouble / count
-p * scala.math.log(p) / scala.math.log(2)
})
.reduce((v1, v2) => v1 + v2)
}
debugString 的输出如下:
(entropy,MappedRDD[93] at map at Q.scala:31 (8 partitions)
UnionRDD[72] at $plus$plus at S.scala:136 (8 partitions)
MappedRDD[60] at map at S.scala:151 (4 partitions)
FilteredRDD[59] at filter at S.scala:150 (4 partitions)
MappedRDD[40] at map at S.scala:124 (4 partitions)
MapPartitionsRDD[39] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
FilteredRDD[27] at filter at S.scala:104 (4 partitions)
MappedRDD[8] at map at X.scala:21 (4 partitions)
MappedRDD[6] at map at R.scala:39 (4 partitions)
FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
HadoopRDD[4] at objectFile at F.scala:51 (4 partitions)
MappedRDD[68] at map at S.scala:151 (4 partitions)
FilteredRDD[67] at filter at S.scala:150 (4 partitions)
MappedRDD[52] at map at S.scala:124 (4 partitions)
MapPartitionsRDD[51] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
FilteredRDD[28] at filter at S.scala:105 (4 partitions)
MappedRDD[8] at map at X.scala:21 (4 partitions)
MappedRDD[6] at map at R.scala:39 (4 partitions)
FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
HadoopRDD[4] at objectFile at F.scala:51 (4 partitions),colIdex,13,count,3922)
如果我收集RDD并再次并行化,则需要大约 150 毫秒来计算(对于一个简单的 2MB 文件来说,这似乎仍然很高)——并且在处理多 GB 数据时显然会带来挑战。正确使用 Spark 和 Scala 我缺少什么?
我最初的实现(表现更糟):
data.map(r => colIdx
.map(idx => r(idx)).mkString(","))
.groupBy(r => r)
.map(g => g._2.size)
.map(v => v.toDouble / count)
.map(v => -v * scala.math.log(v) / scala.math.log(2))
.reduce((v1, v2) => v1 + v2)