r - R 在 .jsonl 文件中读取速度非常慢

Question

我需要将 .jsonl 文件读入 R，而且速度很慢。对于一个 67,000 行的文件，加载需要 10 多分钟。这是我的代码：

library(dplyr)
library(tidyr)
library(rjson)

f<-data.frame(Reduce(rbind, lapply(readLines("filename.jsonl"),fromJSON)))
f2<-f%>%
  unnest(cols = names(f))

这是 .jsonl 文件的示例

{"UID": "a1", "str1": "Who should win?", "str2": "Who should we win?", "length1": 3, "length2": 4, "prob1": -110.5, "prob2": -108.7}
{"UID": "a2", "str1": "What had she walked through?", "str2": "What had it walked through?", "length1": 5, "length2": 5, "prob1": -154.6, "prob2": -154.8}

所以我的问题是：（1）为什么要花这么长时间才能运行，（2）我该如何解决？

score 3 · Accepted Answer

我认为读取 json 行文件的最有效方法是使用jsonlite包中的stream_in()函数。需要 a作为输入，但您可以使用以下函数读取普通文本文件：stream_in()connection

read_json_lines <- function(file){
  con <- file(file, open = "r")
  on.exit(close(con))
  jsonlite::stream_in(con, verbose = FALSE)
}

score 0 · Accepted Answer

您还可以查看ndjson。它是 Niels Lohmann 超级方便的 C++ json 库的包装器。接口类似于jsonlite：

df <- ndjson::stream_in('huge_file.jsonl')

或者，您可以并行化它。当然，这取决于您的特定设置（例如，CPU、HDD、文件），但您可以尝试一下。我经常处理 BigQuery 转储。如果表较大，则输出将跨文件拆分。这允许在文件级别并行化它（并行读取和解析多个文件并合并输出）：

library(furrr)

# my machine has more than 30 cores and a quite fast SSD
# Therefore, it utilises all 20 cores
plan(multisession, workers = 20)

df <- future_map_dfr(
   # this returns a list containing all my jsonline files
   list.files(path = "../data/panel", pattern="00*", full.names=T),
   # each file is parsed separately 
   function(f) jsonlite::stream_in(file(f))
)

r - R 在 .jsonl 文件中读取速度非常慢

2 回答 2

Related

Reference