一般来说,使用 Spark 没有有效的方法来实现这一点DataFrames
。更不用说像订单这样的事情在分布式设置中变得非常棘手。理论上你可以使用lag
如下函数:
from pyspark.sql.functions import lag, col, unix_timestamp
from pyspark.sql.window import Window
dev_time = (unix_timestamp(col("dev_time")) * 1000).cast("timestamp")
df = sc.parallelize([
("2015-09-18 05:00:20", ), ("2015-09-18 05:00:21", ),
("2015-09-18 05:00:22", ), ("2015-09-18 05:00:23", ),
("2015-09-18 05:00:24", ), ("2015-09-18 05:00:25", ),
("2015-09-18 05:00:26", ), ("2015-09-18 05:00:27", ),
("2015-09-18 05:00:37", ), ("2015-09-18 05:00:37", ),
("2015-09-18 05:00:37", ), ("2015-09-18 05:00:38", ),
("2015-09-18 05:00:39", )
]).toDF(["dev_time"]).withColumn("dev_time", dev_time)
w = Window.orderBy("dev_time")
lag_dev_time = lag("dev_time").over(w).cast("integer")
diff = df.select((col("dev_time").cast("integer") - lag_dev_time).alias("diff"))
## diff.show()
## +----+
## |diff|
## +----+
## |null|
## | 1|
## | 1|
## | 1|
## | 1|
## | 1|
## | 1|
## | 1|
## | 10|
## ...
但效率极低(对于窗口函数,如果没有PARTITION BY
提供子句,则将所有数据移动到单个分区)。sliding
在实践中,在 RDD (Scala) 上使用方法或实现自己的滑动窗口 (Python)更有意义。看: