我正在尝试将包含结构列的数据框写入 Elasticsearch:
df1 = spark.createDataFrame([{"date": "2020.04.10","approach": "test", "outlier_score": 1, "a":"1","b":2},
{"date": "2020.04.10","approach": "test", "outlier_score": 0, "a":"2","b":1}],
)
df1 = df1.withColumn('details', to_json(struct(
col('a'),
col('b')
)))
df1.show(truncate=False)
df1.select('date','approach','outlier_score','details').write.format("org.elasticsearch.spark.sql").option('es.resource', 'outliers').save(mode="append")
结果是:
+---+--------+---+----------+-------------+---------------+
|a |approach|b |date |outlier_score|details |
+---+--------+---+----------+-------------+---------------+
|1 |test |2 |2020.04.10|1 |{"a":"1","b":2}|
|2 |test |1 |2020.04.10|0 |{"a":"2","b":1}|
+---+--------+---+----------+-------------+---------------+
这确实有效,但 JSON 被转义,因此相应的详细信息字段在 Kibana 中不可点击:
{
"_index": "outliers",
"_type": "_doc",
"_id": "NuDSA3IBhHa_VjuWENYR",
"_version": 1,
"_score": 0,
"_source": {
"date": "2020.04.10",
"approach": "test",
"outlier_score": 1,
"details": "{\"a\":\"1\",\"b\":2}"
},
"highlight": {
"date": [
"@kibana-highlighted-field@2020.04.10@/kibana-highlighted-field@"
]
}
}
我尝试提供.option("es.input.json","true"),但得到一个异常:
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: mapper_parsing_exception: failed to parse;org.elasticsearch.hadoop.rest.EsHadoopRemoteException: not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes
相反,如果我尝试在不转换为 JSON 的情况下写入数据,即从原始代码中删除to_json( ,我会得到另一个异常:
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: mapper_parsing_exception: failed to parse field [details] of type [text] in document with id 'TuDWA3IBhHa_VjuWFNmX'. Preview of field's value: '{a=2, b=1}';org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_state_exception: Can't get text on a START_OBJECT at 1:68
{"index":{}}
{"date":"2020.04.10","approach":"test","outlier_score":0,"details":{"a":"2","b":1}}
所以问题是如何将带有嵌套 JSON 列的 PySpark 数据框写入 Elasticsearch,这样 JSON 就不会被转义?