0

我有一个像这样的单列数据框

------------
   date     
------------
01/01/2020       
02/01/2020  
04/01/2020    
05/01/2020    
06/01/2020 

我必须得到最长的连续时间开始日期和结束日期。所以在上面的例子中我有这样的输出

-----------------------------------------------
start       |   end           |  period_length |
-----------------------------------------------
04/01/2020    06/01/2020             3         

我的方法:对数据进行排序并找到前一行的滞后,每当有滞后 > 1 时,重置周期长度但我无法找到在特定条件下重置周期的方法。我正在使用火花 2.3

4

1 回答 1

0

注意:我的列名是“eventTime”,如“2020-12-14 13:49:32”

  sc.sql(
  """
    |
    |   select
    |     min(eventTime), max(eventTime) ,  count(1)  as counts
    |   from
    |   (
    |       select
    |           eventTime , date_sub(eventTime , rn) as dis
    |       from
    |       (
    |           select
    |               eventTime , row_number() over(partition by 1 order by eventTime) rn
    |           from (select distinct substring(eventTime,0,10) as eventTime from ST_INOUT_RECORD)
    |       ) t1
    |   ) t2
    |   group by dis  having counts > 2
    |
    |""".stripMargin).show()

结果

|min(eventTime)|max(eventTime)|counts|
+--------------+--------------+------+
|    2020-09-12|    2020-12-14|    94|
|    2020-01-01|    2020-09-10|   254|
+--------------+--------------+------+
于 2020-12-14T06:57:27.440 回答