apache-spark - 将 Dataset 中的嵌套 json 字符串转换为 Spark Scala 中的 Dataset/Dataframe

Question

我有一个简单的程序，它的数据集的列resource_serialized具有 JSON 字符串作为值，如下所示：

import org.apache.spark.SparkConf

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    df.printSchema()
    df.show()
  }
}

打印的架构是：

root
 |-- id: string (nullable = true)
 |-- resource_serialized: string (nullable = true)

打印在控制台上的数据集是：

+--------------------+--------------------+
|                  id| resource_serialized|
+--------------------+--------------------+
|00529e54-0f3d-4c7...|{"createdOn":"200...|
+--------------------+--------------------+

该resource_serialized字段具有 json 字符串，即（来自调试控制台）

现在，我需要用那个 json 字符串创建数据集/数据框，我该如何实现呢？

我的目标是得到这样的数据集：

+--------------------+--------------------+----------+
|                  id|           createdOn|genderCode|
+--------------------+--------------------+----------+
|00529e54-0f3d-4c7...|2000-07-20 00:00    |         0|
+--------------------+--------------------+----------+

score 2 · Accepted Answer

使用from_json函数将 json 字符串转换为 df 列。

Example:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val sch= new StructType().add("createdOn",StringType).add("genderCode",StringType)
df.select(col("id"),from_json(col("resource_serialized"),sch).alias("str")).
select("id","str.*").
show(10,false)

//result
//+----------------------+---------------------+----------+
//|id                    |createdOn            |genderCode|
//+----------------------+---------------------+----------+
//|00529e54-0f3d-4c76-9d3|2000-07-20 00:00:00.0|0         |
//+----------------------+---------------------+----------+

如果你有一个有效的 json，我们可以直接在spark.read.json

val json = """[{"resource_serialized":{"createdOn":"2000-07-20 00:00:00.0","genderCode":"0"},"id":"00529e54-0f3d-4c76-9d3"}]"""

val sch=new StructType().
add("id",StringType).
add("resource_serialized", new StructType().add("createdOn",StringType).
add("genderCode",StringType))

spark.read.option("multiline","true").
schema(sch).
json(Seq(json).toDS).
select("id","resource_serialized.*").
show()
//+--------------------+--------------------+----------+
//|                  id|           createdOn|genderCode|
//+--------------------+--------------------+----------+
//|00529e54-0f3d-4c7...|2000-07-20 00:00:...|         0|
//+--------------------+--------------------+----------+

score 1 · Accepted Answer

下面的解决方案将允许您将所有键值映射resource_serialized到(String,String)稍后可以解析映射的表。

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    val jsonColumn = from_json($"resource_serialized", MapType(StringType, StringType))
    val keysDF = df.select(explode(map_keys(jsonColumn))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> jsonColumn.getItem(f).as(f.toString))
    df.select( $"id" +: keyCols:_*).show(false)

  }
}

输出看起来像

+----------------------+---------------------+----------+
|id                    |createdOn            |genderCode|
+----------------------+---------------------+----------+
|00529e54-0f3d-4c76-9d3|2000-07-20 00:00:00.0|0         |
+----------------------+---------------------+----------+

apache-spark - 将 Dataset 中的嵌套 json 字符串转换为 Spark Scala 中的 Dataset/Dataframe

2 回答 2

Related

Reference