我有一个像这样的数据框:
|SEQ_ID |TIME_STAMP |RESULT |
+-------+-----------------------+----------------+
|3879859|2021-08-31 19:54:53.88 |25.9485244750994|
|3879859|2021-08-31 21:16:06.228|35.9163284302007|
|3879859|2021-08-31 22:28:46.306|41.9778823852006|
|3879859|2021-08-31 22:28:46.306|41.9778823852006|
|3879859|2021-08-31 23:12:08.058|39.9112701415998|
|3879859|2021-08-31 23:17:35.796|33.0476760864009|
|3879859|2021-08-31 23:47:17.383|60.2846145630007|
|3879859|2021-09-01 00:00:26.722|67.0690536499006|
|3879859|2021-09-01 00:00:26.722|67.0690536499006|
|3879859|2021-09-01 00:02:07.825|67.8424835205007|
对于 pyspark 中的正常百分位数计算,我使用以下内容:
df.groupBy('SEQ_ID')\
.agg(f.expr('percentile(RESULT, 0.25)').alias('Q1'),
f.expr('percentile(RESULT, 0.50)').alias('Median'),
f.expr('percentile(RESULT, 0.75)').alias('Q3'))\
但这需要按 . 分组的所有数据SEQ_ID
。我想使用上面的行计算每行的 Q1、中位数和 Q3。
|SEQ_ID |TIME_STAMP |RESULT |Q1 |Median|Q3 |
+-------+-----------------------+----------------+-----+------+-----+
|3879859|2021-08-31 19:54:53.88 |25.9485244750994|
|3879859|2021-08-31 21:16:06.228|35.9163284302007|
|3879859|2021-08-31 22:28:46.306|41.9778823852006|
|3879859|2021-08-31 22:28:46.306|41.9778823852006|
|3879859|2021-08-31 23:12:08.058|39.9112701415998|
|3879859|2021-08-31 23:17:35.796|33.0476760864009|
|3879859|2021-08-31 23:47:17.383|60.2846145630007|
|3879859|2021-09-01 00:00:26.722|67.0690536499006|
|3879859|2021-09-01 00:00:26.722|67.0690536499006|
|3879859|2021-09-01 00:02:07.825|67.8424835205007|
因此Q1
,第一行的 ,Median
和Q3
将是:25.9485244750994
对于第二行,百分位数将使用25.9485244750994
and等来计算35.9163284302007
,以此类推。
如果我定义一个这样的窗口
w=Window.partitionBy('SEQ_ID').orderBy(col('TIME_STAMP').asc()).rangeBetween(Window.unboundedPreceding,0)
以下代码可以工作吗?:
df.groupBy('SEQ_ID')\
.agg(f.expr('percentile(Pad_Wear, 0.25)').alias('Q1'),
f.expr('percentile(Pad_Wear, 0.50)').alias('Median'),
f.expr('percentile(Pad_Wear, 0.75)').alias('Q3')).over(w)