1

我需要计算列的日期差异,考虑不同列中显示的特定 ID 和该特定 ID 的第一个日期,使用 Scala。

我有以下数据集:

在此处输入图像描述

列 ID 显示前面提到的特定 ID,列日期显示事件日期,列排名显示每个特定 ID 的不同事件日期的时间顺序。

我需要计算 ID 1,排名 2 和 3 的日期差异与同一 ID 的排名 1 相比,对于 ID 2 也是如此,依此类推。

预期结果如下:

在此处输入图像描述

有人知道怎么做吗?谢谢!!!

4

2 回答 2

0

您可以通过执行以下步骤获得所需的输出:

//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method. 
display(finalDF)
于 2021-01-16T10:28:37.477 回答
0

除了使用像 Spark 这样的库以 SQL 风格的术语来推理您的数据之外,这可以使用 Collections API 通过首先找到每个 ID 的最小日期然后比较原始集合中的日期来完成:

@ import java.time.temporal.ChronoUnit.DAYS 
import java.time.temporal.ChronoUnit.DAYS

@ import java.time.LocalDate 
import java.time.LocalDate

@ case class Input(id : Int, date : LocalDate, rank : Int) 
defined class Input

@ case class Output(id : Int, date : LocalDate, rank : Int, diff : Long) 
defined class Output
@ val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
  Input(1, LocalDate.of(2020, 12, 12), 2),
  Input(1, LocalDate.of(2020, 12, 16), 3),
  Input(2, LocalDate.of(2020, 12, 11), 1),
  Input(2, LocalDate.of(2020, 12, 13), 2),
  Input(2, LocalDate.of(2020, 12, 14), 3)) 
inData: Seq[Input] = List(
  Input(1, 2020-12-10, 1),
  Input(1, 2020-12-12, 2),
  Input(1, 2020-12-16, 3),
  Input(2, 2020-12-11, 1),
  Input(2, 2020-12-13, 2),
  Input(2, 2020-12-14, 3)
@ val minDates = inData.groupMapReduce(_.id)(identity){(a, b) => 
  a.date.isBefore(b.date) match {
  case true => a
  case false => b
  }}  
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
@ val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date))) 
outData: Seq[Output] = List(
  Output(1, 2020-12-10, 1, 0L),
  Output(1, 2020-12-12, 2, 2L),
  Output(1, 2020-12-16, 3, 6L),
  Output(2, 2020-12-11, 1, 0L),
  Output(2, 2020-12-13, 2, 2L),
  Output(2, 2020-12-14, 3, 3L)
于 2021-01-15T16:23:14.767 回答