0

我正在尝试解析 json 字符串列表的一列,但即使在使用 structType、structField 等尝试了多个模式之后,我也根本无法获取模式。

[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]

[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":33"},{"event":"locationAssignment","count":"73"}]

[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]

基于this SO post,我能够派生json模式,但即使在应用from_json函数之后,它仍然无法工作

Pyspark:解析一列 json 字符串

你能帮忙吗?

4

2 回答 2

1

您可以使用以下 schame 定义解析给定的 json 架构,并将 json 作为提供架构信息的 DataFrame 读取。

>>> dschema = StructType([
...         StructField("event", StringType(),True),
...         StructField("count", StringType(),True)])
>>>

>>>
>>> df = spark.read.json('/<json_file_path>/json_file.json', schema=dschema)
>>>
>>> df.show()
+------------------+-----+
|             event|count|
+------------------+-----+
|       empCreation|  148|
|     jobAssignment|    3|
|locationAssignment|   77|
|       empCreation|  334|
|     jobAssignment|   33|
|locationAssignment|   73|
|       empCreation|   18|
|     jobAssignment|   32|
|locationAssignment|   72|
+------------------+-----+

>>>

json文件内容:

cat json_file.json
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":"33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
于 2018-12-07T08:31:59.260 回答
0

非常感谢@Lakshmanan,但我只需要对架构进行一点改动:

eventCountSchema = ArrayType (StructType([StructField("event", StringType(),True),StructField("count", StringType(),True)]), True)

然后将此模式应用于数据框复杂数据类型列

于 2018-12-07T22:29:02.983 回答