我试图理解 和 之间的coalesce()
区别repartition()
。
如果我正确理解了这个答案,coalesce()
只能减少数据帧的分区数,如果我们尝试增加分区数,则分区数保持不变。
但是当我尝试执行下面的代码时,我观察到了两件事
- 对于具有合并的分区数的 Dataframe 可以增加
- 对于 Rdd,如果 shuffle = false 则分区数不能随着合并而增加。
这是否意味着可以增加合并数据框的分区?
将合并应用于数据框
当我执行以下代码时:
val h1b1Df = spark.read.csv("/FileStore/tables/h1b_data.csv")
println("Original dataframe partitions = " + h1b1Df.rdd.getNumPartitions)
val coalescedDf = h1b1Df.coalesce(2)
println("Coalesced dataframe partitions = " + coalescedDf.rdd.getNumPartitions
val coalescedDf1 = coalescedDf.coalesce(6)
println("Coalesced dataframe with increased partitions = " + coalescedDf1.rdd.getNumPartitions)
我得到以下输出
Original dataframe partitions = 8
Coalesced dataframe partitions = 2
Coalesced dataframe with increased partitions = 6
将合并应用于 RDD
当我执行以下代码时:
val inpRdd = h1b1Df.rdd
println("Original rdd partitions = " + inpRdd.getNumPartitions)
val coalescedRdd = inpRdd.coalesce(4)
println("Coalesced rdd partitions = " + coalescedRdd.getNumPartitions)
val coalescedRdd1 = coalescedRdd.coalesce(6, false)
println("Coalesced rdd with increased partitions = " + coalescedRdd1.getNumPartitions)
我得到以下输出:
Original rdd partitions = 8
Coalesced rdd partitions = 4
Coalesced rdd with increased partitions = 4