我需要计算列的日期差异,考虑不同列中显示的特定 ID 和该特定 ID 的第一个日期,使用 Scala。
我有以下数据集:
列 ID 显示前面提到的特定 ID,列日期显示事件日期,列排名显示每个特定 ID 的不同事件日期的时间顺序。
我需要计算 ID 1,排名 2 和 3 的日期差异与同一 ID 的排名 1 相比,对于 ID 2 也是如此,依此类推。
预期结果如下:
有人知道怎么做吗?谢谢!!!
您可以通过执行以下步骤获得所需的输出:
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)
除了使用像 Spark 这样的库以 SQL 风格的术语来推理您的数据之外,这可以使用 Collections API 通过首先找到每个 ID 的最小日期然后比较原始集合中的日期来完成:
@ import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
@ import java.time.LocalDate
import java.time.LocalDate
@ case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
@ case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
@ val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
@ val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
@ val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)