4

我有一个凌乱的文件,我正试图将其解析为 R 中的数字数据。数据包含在一个不是 XML 的文件中,但遵循特定格式:

"{"metrics":{"skin_temp":{"min":81.5,"max":96.8,"sum":93480.6,
  "summary":{"max_skin_temp_per_minute":null,"min_skin_temp_per_minute":null},
  "values":[93.2,93.2,93.3,93.3]],"stdev":0.9,"avg":2.1},
  "gsr":{"min":0.000149,"max":31.5,"sum":10300.0,
  "summary":{"max_gsr_per_minute":null,"min_gsr_per_minute":null},
  "values":[1.22,1.23,1.2,1.2],"stdev":9.630000000000001,"avg":10.1},
  "steps":{"min":0,"max":104,"sum":4202,
  "summary":{"max_steps_per_minute":null,"min_steps_per_minute":null},
  "values":[0,0,0,0]],"stdev":13.8,"avg":4}}"

我感兴趣的只是"values"标签后面的代码块(此信息包含在我从中提取数据的网站中,但如果我需要它们,我可以轻松地计算 R 中的汇总统计信息)。

我知道有一种更简单的方法,但到目前为止我的代码如下所示:

raw_data      <- gsub('\\"', '', raw_data)
analysis_data <- c()
positioner    <- 0

for (x in 1:3) {
  # find where the data starts (and add 8 more for the 'values' text)
  data_start    <- regexpr("values:[", substring(raw_data, positioner), 
                           fixed=TRUE)[[1]] + 8 + positioner    
  data_end      <- regexpr("]", substring(raw_data, data_start), 
                           fixed=TRUE)[[1]] + data_start - 2
  data_col      <- as.numeric(strsplit(substring(raw_data, data_start, 
                              data_end), ", ")[[1]])
  analysis_data <- cbind(analysis_data, data_col)
  positioner    <- positioner + data_end
} 

有时这可行,但有时positioner变量会被欺骗。有没有更简单的方法来提取这段代码?

4

1 回答 1

4

您看到的原始数据格式称为JSON(请参阅什么是 JSON?

但是,正如@user1609452 在评论中指出的那样,它的格式很差。如果 OP 中发布的内容代表正在使用的实际原始数据,那么它只是有一些放错位置的双方括号并缺少一个右花括号。两者都很容易修复。

修复 JSON

# store the JSON as a single string
raw_data <- '{"metrics":{"skin_temp":{"min":81.5,"max":96.8,"sum":93480.6, "summary":{"max_skin_temp_per_minute":null,"min_skin_temp_per_minute":null}, "values":[93.2,93.2,93.3,93.3]],"stdev":0.9,"avg":2.1}, "gsr":{"min":0.000149,"max":31.5,"sum":10300.0, "summary":{"max_gsr_per_minute":null,"min_gsr_per_minute":null}, "values":[1.22,1.23,1.2,1.2],"stdev":9.630000000000001,"avg":10.1}, "steps":{"min":0,"max":104,"sum":4202, "summary":{"max_steps_per_minute":null,"min_steps_per_minute":null}, "values":[0,0,0,0]],"stdev":13.8,"avg":4}}'


## Clean up the JSON
raw_data <- gsub("\\]\\]", "\\]", raw_data)
raw_data <- paste0(raw_data, "}")

一旦你的 JSON 干净整洁,就很容易解析:

library(rjson)
dat <- fromJSON(raw_data)
lapply(dat[["metrics"]], function(D) if ("values" %in% names(D)) D$values else NA)

# or more succinctly: 
lapply(dat[["metrics"]], `[[`, "values")

结果:

$skin_temp
[1] 93.2 93.2 93.3 93.3

$gsr
[1] 1.22 1.23 1.20 1.20

$steps
[1] 0 0 0 0
于 2013-08-03T14:37:24.980 回答