r - 将字符列表转换为数据框

Question

我有一些 JSON 格式的数据，我试图在 R 中使用。我的问题是我无法以正确的格式获取数据。

require(RJSONIO)

json <- "[{\"ID\":\"id1\",\"VALUE\":\"15\"},{\"ID\":\"id2\",\"VALUE\":\"10\"}]"
example <- fromJSON(json)

example <- do.call(rbind,example)
example <- as.data.frame(example,stringsAsFactors=FALSE)

> example
   ID VALUE
1 id1    15
2 id2    10

这接近了，但我无法将数字列转换为数字。我知道我可以手动转换列，但我想data.frame或as.data.frame扫描了数据并做出了最合适的类定义。显然我误解了。我正在阅读许多表格 - 所有表格都非常不同 - 我需要将数字数据视为数字数据。

最终，当数据为数字时，我希望获得带有数字列的数据表。

score 4 · Accepted Answer

read.table用于type.convert将数据转换为适当的类型。读入 JSON 数据后，您可以执行与清理步骤相同的操作。

sapply(example,class)
         # ID       VALUE 
# "character" "character" 
example[] <- lapply(example, type.convert, as.is = TRUE)
sapply(example, class)
         # ID       VALUE 
# "character"   "integer"

score 1 · Accepted Answer

我建议您使用该jsonlite包，默认情况下会将其转换为数据框

jsonlite::fromJSON(json)

   ID VALUE
1 id1    15
2 id2    10

注意：numeric问题仍然存在，因为json没有编码数据类型。因此，您将不得不手动转换数字列。

score 0 · Accepted Answer

为了跟进 Ramnath 的过渡建议，jsonlite我对这两种方法进行了一些基准测试：

##RJSONIO vs. jsonlite for a simple example

require(RJSONIO)
require(jsonlite)
require(microbenchmark)

json <- "{\"ID\":\"id1\",\"VALUE\":\"15\"},{\"ID\":\"id2\",\"VALUE\":\"10\"}"
test <- rep(json,1000)
test <- paste(test,collapse=",")
test <- paste0("[",test,"]")

func1 <- function(x){
  temp <- jsonlite::fromJSON(x)
}

func2 <- function(x){
  temp <- RJSONIO::fromJSON(x)
  temp <- do.call(rbind,temp)
  temp <- as.data.frame(temp,stringsAsFactors=FALSE)
}

> microbenchmark(func1(test),func2(test))
Unit: milliseconds
       expr       min        lq    median        uq       max neval
func1(test) 204.05228 221.46047 233.93321 246.90815 341.95684   100
func2(test)  21.60289  22.36368  22.70935  23.75409  27.41851   100

至少现在，我知道这个jsonlite包仍然是新的，并且专注于准确性而不是性能，对于这个简单的例子，旧的 RJSONIO 执行得更快——即使将列表转换为数据框也是如此。

更新包括rjson：

require(rjson)

func3 <- function(x){
  temp <- rjson::fromJSON(x)
  temp <- do.call(rbind,lapply(temp,unlist))
  temp <- as.data.frame(temp,stringsAsFactors=FALSE)
}

> microbenchmark(func1(test),func2(test),func3(test))
Unit: milliseconds
       expr       min        lq    median        uq       max neval
func1(test) 205.34603 220.85428 234.79492 249.87628 323.96853   100
func2(test)  21.76972  22.67311  23.11287  23.56642  32.97469   100
func3(test)  14.16942  15.96937  17.29122  20.19562  35.63004   100

> microbenchmark(func1(test),func2(test),func3(test),times=500)
Unit: milliseconds
       expr       min        lq    median        uq       max neval
func1(test) 206.48986 225.70693 241.16301 253.83269 336.88535   500
func2(test)  21.75367  22.53256  23.06782  23.93026 103.70623   500
func3(test)  14.21577  15.61421  16.86046  19.27347  95.13606   500

> identical(func1(test),func2(test)) & identical(func1(test),func3(test))
[1] TRUE

至少在我的机器rjson上只是稍微快了一点，尽管我没有测试它是如何扩展的，与RJSONIORamnath 建议的性能提升相比，它可能是在哪里获得了巨大的性能提升。

r - 将字符列表转换为数据框

3 回答 3

Related

Reference