I have this kind of JSON data:
{
"data": [
{
"id": "4619623",
"team": "452144",
"created_on": "2018-10-09 02:55:51",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
},
{
"id": "4619600",
"team": "452144",
"created_on": "2018-10-09 02:42:25",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
}
}
I read this data using Apache spark and I want to write them partition by id column. When I use this:
df.write.partitionBy("data.id").json(<path_to_folder>)
I will get error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Partition column data.id not found in schema
I also tried to use explode function like that:
import org.apache.spark.sql.functions.{col, explode}
val renamedDf= df.withColumn("id", explode(col("data.id")))
renamedDf.write.partitionBy("id").json(<path_to_folder>)
That actually helped, but each id partition folder contained the same original JSON file.
EDIT: schema of df DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
Schema of renamedDf DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
|-- id: string (nullable = true)
I am using spark 2.1.0
I found this solution: DataFrame partitionBy on nested columns
And this example:http://bigdatums.net/2016/02/12/how-to-extract-nested-json-data-in-spark/
But none of this helped me to solve my problem.
Thanks in andvance for any help.