1

我有以下数据框:

   ---------+--------+----------+-----------+--------------------+--------------------+-------+-----+------------
|     id|groupid||              field|   oldstring|    newstring|               created|        pkey|   project     
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128|               Team1|    [RLA N1]|   [N1-UO-SB]|  2013-03-29 13:31:...|DSTECH-55770|   10120|
|1451958| 594140|               Team1|    [SEP N2]|     [SEP N2]|  2013-03-29 13:34:...|DSTECH-56998|   10120|   
|1452282| 594308|               Team1|  [N1-UO-SE]|     [SEP N2]|  2013-03-29 14:09:...|DSTECH-57900|   10120|    
|1492252| 610736|               Team1|  [N1-UO-SE]|     [SEP N2]|  2013-04-17 08:48:...|DSTECH-59560|   10120|
|5105082|2304145|               Team1|     [Aucun]|[SEP-SUPPORT]|  2017-09-01 09:46:...|    ECO-9781|   10280|
|5105084|2304145|               Team2|        null|  SEP-SUPPORT|  2017-09-01 09:46:...|    ECO-9781|   10280|
|5105084|2304145|               Team1|    [ISR N2]| SEP-SUPPORT | 2013-03-29 13:31:... |DSTECH-57895|   10120|
|1451926|594129 |               Team1|  [N1-UO-SE]|   [ISR N2]  |2013-03-29 13:55:...  |DSTECH-57895|   10120|
|1452182|594273 |               Team1|  [N1-UO-SE]| [SEPN1-ENV] |2013-03-29 13:43:...  |DSTECH-57895|   10120|

我想计算治疗日期/时间[pkey]例如,我有这两行:

|     id|groupid||              field|   oldstring|    newstring|               created|        pkey|        
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128|               Team1|    [RLA N1]|   [N1-UO-SB]|  2013-03-29 13:31:...|DSTECH-55770|   
|1451958| 594140|               Team1|    [SEP N2]|     [SEP N2]|  2013-03-29 13:34:...|DSTECH-56998| 

治疗日期/时间[DSTECH-55770] = [2013-03-29 13:34:...] - [2013-03-29 13:31:...]

我如何计算与前一个日期的差异,我发现我可以使用用户定义的聚合函数 UDAF 来做到这一点。但是,如果此解决方案对于在数字中显示两个日期之间的差异有用(例如:8h:30min),我并不是说 8H 是在时钟 8H,而是小时数是 8。

如果有人可以帮助我,我该如何使用 UDAF 或者您有其他解决方案?谢谢

4

1 回答 1

1

可能是 SQL 窗口函数的情况。您可以在此处找到更多详细信息

我怀疑生成的代码可能看起来像

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val sparkSession = ...  // Create as do
import sparkSession.implicits._
// For the same project, order rows by `created` column
val partitionWindow = Window.partitionBy("project").orderBy("created".asc)
// Get me the value of `created` column in next row in a new column called datediff
val createdTimeNextRowSameProject = lead($"created",
                                         1,                   // 1 = next_row, 2 = 2 rows after, so on 
                                         "CURRENT_TIMESTAMP"  // default if next is null
                                        ).over(partitionWindow)
val dfWithTimeDiffInSeconds = df.withColumn("datediff", unix_timestamp(leadDate) - unix_timestamp($"created"))
dfWithTimeDiffInSeconds.show(10)
于 2018-06-11T16:38:40.447 回答