python - PySpark：将时间戳添加到日期列并将整个列重新格式化为时间戳数据类型

Question

我在 PySpark 中有以下示例数据框。该列当前是 Date 数据类型。

scheduled_date_plus_one
12/2/2018
12/7/2018

我想重新格式化日期并根据 24 小时制添加凌晨 2 点的时间戳。下面是我想要的数据框列输出：

scheduled_date_plus_one
2018-12-02T02:00:00Z
2018-12-07T02:00:00Z

我如何实现上述目标？我知道如何在 Python Pandas 中执行此操作，但不熟悉 PySpark。

我知道我想要的列将是字符串数据类型，因为我的值中有“T”和“Z”。没关系...我想我已经知道如何将字符串数据类型转换为时间戳，所以我已经准备好了。

score 1 · Accepted Answer

让我们为您创建这个PySpark DataFrame。to_date您必须从functions模块导入-

步骤 0：导入这 4 个函数 -

from pyspark.sql.functions import to_date, date_format, concat, lit

步骤1：

from pyspark.sql.functions import to_date, date_format, concat, lit
values = [('12/2/2018',),('12/7/2018',)]
df = sqlContext.createDataFrame(values,['scheduled_date_plus_one'])
df = df.withColumn('scheduled_date_plus_one',to_date('scheduled_date_plus_one','MM/dd/yyyy'))
df.printSchema()

root
 |-- scheduled_date_plus_one: date (nullable = true)

df.show()
+-----------------------+
|scheduled_date_plus_one|
+-----------------------+
|             2018-12-02|
|             2018-12-07|
+-----------------------+

正如我们在中看到的.printSchema()，我们有日期date格式。因此，作为我们的第一步，我们创建了所需的DataFrame.

第 2 步：scheduled_date_plus_one从date格式转换为string格式，以便我们可以连接T02:00:00Z到它。date_format将日期转换为所需格式的字符串。我们拿了yyyy-MM-dd.

df = df.withColumn('scheduled_date_plus_one',date_format('scheduled_date_plus_one',"yyyy-MM-dd"))
df.printSchema()
root
 |-- scheduled_date_plus_one: string (nullable = true)

df.show()
+-----------------------+
|scheduled_date_plus_one|
+-----------------------+
|             2018-12-02|
|             2018-12-07|
+-----------------------+

.printSchema()上面显示了scheduled_date_plus_one转换为string格式，现在我们可以做concatenation部分了。

第 3 步：连接 - 为此，我们使用concat函数。注意 - 您必须T02:00:00Z在lit()函数中屏蔽，因为我们没有连接两列。

df = df.withColumn('scheduled_date_plus_one',concat('scheduled_date_plus_one',lit('T02:00:00Z')))
df.show()
+-----------------------+
|scheduled_date_plus_one|
+-----------------------+
|   2018-12-02T02:00:00Z|
|   2018-12-07T02:00:00Z|
+-----------------------+

python - PySpark：将时间戳添加到日期列并将整个列重新格式化为时间戳数据类型

1 回答 1

Related

Reference