我有一个简单的程序,它的数据集的列resource_serialized具有 JSON 字符串作为值,如下所示:
import org.apache.spark.SparkConf
object TestApp {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")
val spark = org.apache.spark.sql.SparkSession
.builder
.config(sparkConf)
.appName("Test")
.getOrCreate()
val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"
import spark.implicits._
val df = spark.read.json(Seq(json).toDS)
df.printSchema()
df.show()
}
}
打印的架构是:
root
|-- id: string (nullable = true)
|-- resource_serialized: string (nullable = true)
打印在控制台上的数据集是:
+--------------------+--------------------+
| id| resource_serialized|
+--------------------+--------------------+
|00529e54-0f3d-4c7...|{"createdOn":"200...|
+--------------------+--------------------+
该resource_serialized字段具有 json 字符串,即(来自调试控制台)
现在,我需要用那个 json 字符串创建数据集/数据框,我该如何实现呢?
我的目标是得到这样的数据集:
+--------------------+--------------------+----------+
| id| createdOn|genderCode|
+--------------------+--------------------+----------+
|00529e54-0f3d-4c7...|2000-07-20 00:00 | 0|
+--------------------+--------------------+----------+
