一位同事向我发送了一个 Elasticsearch 查询结果(100000 条记录,数百个属性),如下所示:
pets_json <- paste0('[{"animal":"cat","attributes":{"intelligence":"medium","noises":[{"noise":"meow","code":4},{"noise":"hiss","code":2}]}},',
'{"animal":"dog","attributes":{"intelligence":"high","noises":{"noise":"bark","code":1}}},',
'{"animal":"snake","attributes":{"intelligence":"low","noises":{"noise":"hiss","code":2}}}]')
有一个多余的密钥,code
我不需要捕获。
我想生成一个类似于以下内容的 data.frame:
animal intelligence noises.bark noises.hiss noises.meow
cat medium 0 1 1
dog high 1 0 0
snake low 0 1 0
我可以在 json 中阅读,但flatten=TRUE
不能完全变平:
library(jsonlite)
str(df <- fromJSON(txt=pets_json, flatten=TRUE))
# 'data.frame': 3 obs. of 3 variables:
# $ animal : chr "cat" "dog" "snake"
# $ attributes.intelligence: chr "medium" "high" "low"
# $ attributes.noises :List of 3
# ..$ :'data.frame': 2 obs. of 2 variables: \
# .. ..$ noise : chr "meow" "hiss" \
# .. ..$ code: int 4 2 |
# ..$ :List of 2 |
# .. ..$ noise : chr "bark" |- need to remove code and flatten
# .. ..$ code: int 1 |
# ..$ :List of 2 |
# .. ..$ noise : chr "hiss" /
# .. ..$ code: int 2 /
因为展平不完整,我可以在调用另一个之前使用这个中间阶段来摆脱code
不需要的键flatten()
,但我知道摆脱键的唯一方法真的很慢:
for( l in which(sapply(df, is.list)) ){
for( l2 in which(sapply(df[[l]], is.list))){
df[[l]][[l2]]['code'] <- NULL
}
}
( df <- data.frame(flatten(df)) )
# animal attributes.intelligence attributes.noises
# 1 cat medium meow, hiss
# 2 dog high bark
# 3 snake low hiss
然后在那之后……?我知道使用tidyr::separate
我可能会想出一种spread
将噪声值放入列并设置标志的hacky方法。但这一次只适用于一个属性,我可能有数百个。我事先并不知道所有可能的属性值。
如何有效地生成所需的 data.frame?谢谢你的时间!