0

使用窗口分析函数,我需要获取最大日期 - 不包括当前行的列值

Account,Instrument,TrDate
1,A,3/1/2018
1,A,3/2/2018
1,B,3/3/2018
1,B,3/6/2018
1,B,3/6/2018
1,B,3/7/2018
2,A,2/7/2018
2,A,2/5/2018
2,B,2/15/2018
2,B,3/6/2018

预期变换的 DF

Account,Instrument,TrDate,MaxInDate,ExcInstrMaxDate
1,A,3/1/2018,3/2/2018,3/7/2018
1,A,3/2/2018,3/2/2018,3/7/2018
1,B,3/3/2018,3/7/2018,3/2/2018
1,B,3/6/2018,3/7/2018,3/2/2018
1,B,3/6/2018,3/7/2018,3/2/2018
1,B,3/7/2018,3/7/2018,3/2/2018
2,A,2/7/2018,2/7/2018,3/6/2018
2,A,2/5/2018,2/7/2018,3/6/2018
2,B,2/15/2018,3/6/2018,2/7/2018
2,B,3/6/2018,3/6/2018,2/7/2018

计算 ExcInstrMaxDate

在账户窗口中获取 Max TrDate,不包括特定的仪器,即对于账户 1,仪器 A,ExcInstrMaxDate 是账户 1 的 maxDate,由仪器 A 过滤

4

1 回答 1

0

您只需要两个 Window 函数,一个用于 the MaxInDate,另一个用于ExcInstrMaxDate

import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("Account", "Instrument")
def windowSpec2 = Window.partitionBy("Account")

您还需要一个函数 * 从分组列表中udf删除当前MaxInDate *MaxInDateAccount

import org.apache.spark.sql.functions._
def removeCurrentMax = udf((currentMax: String, listMax: Seq[String])=> listMax.filterNot(_ == currentMax))

并同时使用Window函数和udf函数作为

df.withColumn("MaxInDate", max("TrDate").over(windowSpec1))
  .withColumn("ExcInstrMaxDate", removeCurrentMax(col("MaxInDate"), collect_set("MaxInDate").over(windowSpec2)))
  .show(false)

你应该得到

+-------+----------+---------+---------+---------------+
|Account|Instrument|TrDate   |MaxInDate|ExcInstrMaxDate|
+-------+----------+---------+---------+---------------+
|1      |A         |3/1/2018 |3/2/2018 |[3/7/2018]     |
|1      |A         |3/2/2018 |3/2/2018 |[3/7/2018]     |
|1      |B         |3/3/2018 |3/7/2018 |[3/2/2018]     |
|1      |B         |3/6/2018 |3/7/2018 |[3/2/2018]     |
|1      |B         |3/6/2018 |3/7/2018 |[3/2/2018]     |
|1      |B         |3/7/2018 |3/7/2018 |[3/2/2018]     |
|2      |A         |2/7/2018 |2/7/2018 |[3/6/2018]     |
|2      |A         |2/5/2018 |2/7/2018 |[3/6/2018]     |
|2      |B         |2/15/2018|3/6/2018 |[2/7/2018]     |
|2      |B         |3/6/2018 |3/6/2018 |[2/7/2018]     |
+-------+----------+---------+---------+---------------+

我希望答案有帮助

请注意,我已TrDate用作StringType

于 2018-03-16T17:54:51.803 回答