scala - Scala：使用 spark 3.1.2 解析时间戳

Question

我有一个 Excel 阅读器，我将结果放入 sparks 数据框中。我在解析时间戳时遇到问题。

我有时间戳作为字符串，如Wed Dec 08 10:49:59 CET 2021. 我使用的是 spark-sql 版本2.4.5，一切正常，直到我最近更新到 version 3.1.2。

请在下面找到一些最小的代码。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}

val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val oldfmt: String = "E MMM dd HH:mm:ss z yyyy"

val ttdf = Seq(ts)
  .toDF("theTimestampColumn")
  .withColumn("parsedTime", to_timestamp(col("theTimestampColumn"), fmt = oldfmt))

ttdf.show()

使用 spark 版本运行此代码的2.4.5工作方式与预期一样，并产生以下输出：

+--------------------+-------------------+
|  theTimestampColumn|         parsedTime|
+--------------------+-------------------+
|Wed Dec 08 20:49:...|2021-12-08 20:49:59|
+--------------------+-------------------+

现在，仅使用 spark version 执行相同的代码3.1.2会导致以下错误：

Exception in thread "main" org.apache.spark.SparkUpgradeException: 
You may get a different result due to the upgrading of Spark 3.0: 
Fail to recognize 'E MMM dd HH:mm:ss z yyyy' pattern in the DateTimeFormatter. 
1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

（可点击链接：https ://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html ）

这个网站对我没有进一步的帮助。我在我的格式字符串中没有发现任何错误。该符号E表示day-of-week为文本，如Tue; Tuesday。符号M代表month-of-year类似7; 07; Jul; July。这些符号H,m,s,y分别是小时、分钟、秒或年。符号z表示time-zone name类似Pacific Standard Time; PST。我在这里错过了一些明显的东西吗？

任何帮助将不胜感激。先感谢您。

score 1 · Accepted Answer

如日期时间模式文档中所述，您E只能用于日期时间格式，而不能用于解析：

'E'、'F'、'q' 和'Q' 符号只能用于日期时间格式，例如date_format。它们不允许用于日期时间解析，例如 to_timestamp。

如果要应用 Spark 版本 <3.0 的行为，可以将spark.sql.legacy.timeParserPolicy选项设置为LEGACY：

sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

如果您不想更改 spark 配置，可以使用substrSQL 函数删除代表天的字符：

import org.apache.spark.sql.functions.{col, to_timestamp, expr}

val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val fmt: String = "MMM dd HH:mm:ss z yyyy"

val ttdf = Seq(ts)
  .toDF("theTimestampColumn")
  .withColumn("preparedTimestamp", expr("substr(theTimestampColumn, 5, length(theTimestampColumn))"))
  .withColumn("parsedTime", to_timestamp(col("preparedTimestamp"), fmt = fmt))
  .drop("preparedTimestamp")

scala - Scala：使用 spark 3.1.2 解析时间戳

1 回答 1

Related

Reference