您可以创建一个窗口来计算过去 7 天内记录发生的次数。但是,如果您尝试查看记录在毫秒级别上出现的次数,它就会崩溃。
简而言之,下面的函数df.timestamp.astype('Timestamp').cast("long")
只将时间戳转换为一秒到一个长。它忽略毫秒。您如何将整个时间戳(包括毫秒)转换为 long。您需要将值设置为 long 以便它可以与窗口一起使用。
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = sqlContext.createDataFrame([
("a", "u8u", "2018-02-02 05:46:41.438357"),
("a", "u8u", "2018-02-02 05:46:41.439377"),
("a", "a3a", "2018-02-02 09:48:34.081818"),
("a", "a3a", "2018-02-02 09:48:34.095586"),
("a", "g8g", "2018-02-02 09:48:56.006206"),
("a", "g8g", "2018-02-02 09:48:56.007974"),
("a", "9k9", "2018-02-02 12:50:48.000000"),
("a", "9k9", "2018-02-02 12:50:48.100000"),
], ["person_id", "session_id", "timestamp"])
df = df.withColumn('unix_ts',df.timestamp.astype('Timestamp').cast("long"))
df = df.withColumn("DayOfWeek",F.date_format(df.timestamp, 'EEEE'))
w = Window.partitionBy('person_id','DayOfWeek').orderBy('unix_ts').rangeBetween(-86400*7,-1)
df = df.withColumn('count',F.count('unix_ts').over(w))
df.sort(df.unix_ts).show(20,False)
+---------+----------+--------------------------+----------+---------+-----+
|person_id|session_id|timestamp |unix_ts |DayOfWeek|count|
+---------+----------+--------------------------+----------+---------+-----+
|a |u8u |2018-02-02 05:46:41.438357|1517572001|Friday |0 |
|a |u8u |2018-02-02 05:46:41.439377|1517572001|Friday |0 |
|a |a3a |2018-02-02 09:48:34.081818|1517586514|Friday |2 |
|a |a3a |2018-02-02 09:48:34.095586|1517586514|Friday |2 |
|a |g8g |2018-02-02 09:48:56.006206|1517586536|Friday |4 |
|a |g8g |2018-02-02 09:48:56.007974|1517586536|Friday |4 |
|a |9k9 |2018-02-02 12:50:48.000000|1517597448|Friday |6 |
|a |9k9 |2018-02-02 12:50:48.100000|1517597448|Friday |6 |
+---------+----------+--------------------------+----------+---------+-----+
计数应该是 0,1,2,3,4,5... 而不是 0,0,2,2,4,4,...