dataframe - 合并数据字符串和时间字符串

Question

你会怎么做？Databricks 4.1、Spark 2.3

您将获得一个两列数据框：1) '<code>dt'，字符串，如图所示。2) '<code>tm' 字符串，如图所示。我为这篇文章添加了第三列。

你的工作是创建第 3 列，“<code>dtm”、时间戳、格式、前导零、精度和时区不如正确组合“<code>dt”和“<code>tm”的 id 重要。

我在这篇文章中使用了 PySpark，但我并没有与之结婚。

df1 = sqlContext.createDataFrame(
  [
     ('2018-06-02T00:00:00','12:30:00 AM', '06-02-2018 00:30:00.000+0000')
    ,('2018-11-15T00:00:00','03:00:00 AM', '11-15-2018 03:00:00.000+0000')
    ,('2018-06-02T00:00:00','10:30:00 AM', '06-02-2018 10:30:00.000+0000')
    ,('2018-06-02T00:00:00','12:30:00 PM', '06-02-2018 12:30:00.000+0000')
    ,('2018-11-15T00:00:00','03:00:00 PM', '11-15-2018 15:00:00.000+0000')
    ,('2018-06-02T00:00:00','10:30:00 PM', '06-02-2018 22:30:00.000+0000')
  ]
  ,['dt', 'tm', 'desiredCalculatedResult']
)

我已经经历了几十个几十个例子和尝试，到目前为止我还没有找到最终可行的解决方案。

score 7 · Accepted Answer

您可以在“T”处拆分日期字符串以仅提取日期部分，然后您可以将其与时间字符串组合以获得表示您要创建的实际时间戳的字符串。然后只需将其转换为正确的格式即可。

from pyspark.sql.functions import concat, split, lit, from_unixtime, unix_timestamp

dt_tm = concat(split(df1.dt, "T")[0], lit(" "), df1.tm)
df1 = df1.withColumn("dttm", from_unixtime(unix_timestamp(dt_tm, 'yyyy-MM-dd hh:mm:ss a')).cast("timestamp"))

score 0 · Accepted Answer

请看一下内置函数

你想看看：

date_format
to_timestamp
unix_timestamp
from_utc_timestamp

这些或其他 DateTime 函数的组合将实现您的目标。Spark 2.x 对操作日期时间有强大的支持，但是，如果您真的无法使用内置函数完成它，您可以随时回退到 Joda Time Java 包。

dataframe - 合并数据字符串和时间字符串

2 回答 2

Related

Reference