我有一个带日期的数据框,想过滤过去 3 天(不是基于当前时间,而是数据集中可用的最新时间)
+---+----------------------------------------------------------------------------------+----------+
|id |partition |date |
+---+----------------------------------------------------------------------------------+----------+
|1 |/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl|2019-12-01|
|2 |/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-30|
|3 |/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-29|
|4 |/raw/gsec/qradar/flows/dt=2019-11-28/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-28|
|5 |/raw/gsec/qradar/flows/dt=2019-11-27/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-27|
+---+----------------------------------------------------------------------------------+----------+
应该返回
+---+----------------------------------------------------------------------------------+----------+
|id |partition |date |
+---+----------------------------------------------------------------------------------+----------+
|1 |/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl|2019-12-01|
|2 |/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-30|
|3 |/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-29|
+---+----------------------------------------------------------------------------------+----------+
编辑:我采用@Lamanus 回答从分区字符串中提取日期
df = sqlContext.createDataFrame([
(1, '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl'),
(2, '/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl'),
(3, '/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl'),
(4, '/raw/gsec/qradar/flows/dt=2019-11-28/hour=00/1585218406613_flows_20191201_00.jsonl'),
(5, '/raw/gsec/qradar/flows/dt=2019-11-27/hour=00/1585218406613_flows_20191201_00.jsonl')
], ['id','partition'])
df.withColumn('date', F.regexp_extract('partition', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.show(10, False)