pyspark - from_json 在 Apache Spark 3.0 中返回 null

Question

我有一个包含字典数组的字符串类型的 pyspark 列。

     x = {"a":1,"b":[{"type":"abc","unitValue":"4.4"}]}

我想将字符串转换为结构数组，但是在这样做时，新列中的字段被填充为空。

Databricks 运行时 - 8.3（包括 Apache Spark 3.1.1、Scala 2.12）

我的数据框看起来像：

     from pyspark.sql.functions import *
     from pyspark.sql.types import *

     inputSchema = StructType([StructField("a",StringType(),True),
                         StructField("b",StringType(),True)])

     jsonStruct = StructType([StructField("type",StringType(),True),
                         StructField("unitValue",StringType(),True)])
     
     df = spark.createDataFrame(data =[x],schema = inputSchema).show()

     +---+---------------------------+
     |  a|                   b       |
     +---+---------------------------+
     |  1|[{type=abc, unitValue=4.4}]|
     +---+---------------------------+

     df.printSchema()
     root
     |-- a: string (nullable = true)
     |-- b: string (nullable = true)

我正在使用 from_json 函数来实现相同的目的，但值被填充为 null

     df1 = df.withColumn("newvalue",from_json(col("b"),jsonStruct,{"mode" : "PERMISSIVE"}))
     display(df1)
     
     +---+----------------------------+----------------------------------+
     |  a|                   b        |    newvalue                      |
     +---+----------------------------+----------------------------------+
     |  1|[{type=abc, unitValue=xyz}] |{"type": null, "unitValue": null} |
     +---+----------------------------+----------------------------------+

有人可以在这里帮助我吗

score 1 · Accepted Answer

在列bJSON 结构不正确。创建数据框:后被替换为=.

b您必须在声明变量本身时将类型设置为字符串，或者您必须使用regexp_replace()=进行替换:

x = {"a":1,"b":'[{"type":"abc","unitValue":"4.4"}]'}

并且您需要更改 JSON 模式，如下所示。

jsonStruct = ArrayType(StructType([
       StructField("type",StringType(),True), 
       StructField("unitValue",StringType(),True)]),True)

score 0 · Accepted Answer

此问题特定于 spark 3.0.0 及更高版本。Databricks 问题链接：https ://kb.databricks.com/scala/from-json-null-spark3.html 我也找到了解决方案。

     inputSchema = StructType([StructField("a",StringType(),True),StructField("b",StringType(),True)])

     jsonStruct = ArrayType(StructType([StructField("type",StringType(),True),StructField("unitValue",StringType(),True)]),True)

     x = {"a":1,"b":'[{"type":"abc","unitValue":"xyz"}]'}

     df = spark.createDataFrame(data =[x],schema = schema)

     df = df.withColumn("b",regexp_replace('b', '=', ':').cast(StringType()))

     df1 = df.withColumn("newvalue",from_json(col("b"),jsonStruct,{"mode" : "PERMISSIVE"}))

     display(df1)

     +---+----------------------------+----------------------------------+
     |  a|                   b        |    newvalue                      |
     +---+----------------------------+----------------------------------+
     |  1|[{type=abc, unitValue=xyz}] |{"type": "abc", "unitValue": "xyz"} |
     +---+----------------------------+----------------------------------+

pyspark - from_json 在 Apache Spark 3.0 中返回 null

2 回答 2

Related

Reference