json - 无法将 jsonlite::stream_in 与某些 JSON 格式一起使用

Question

我正在尝试从 YASP 数据转储（https://github.com/yasp-dota/yasp/wiki/JSON-Data-Dump）流式传输相当大的（65gb）JSON文件，但似乎JSON 文件已格式化意味着我无法读取该文件，并给出此错误：

错误：解析错误：过早的 EOF [（就在这里）------^

我使用相同的格式创建了这个小的示例 JSON 文件，因此其他人都可以轻松地重新创建它：

[
{"match_id": 2000594819,"match_seq_num": 1764515493}
,
{"match_id": 2000594820,"match_seq_num": 1764515494}
,
{"match_id": 2000594821,"match_seq_num": 1764515495}
]

我已将此文件保存为 test.json，并尝试通过 jsonlite::stream_in 函数加载它

library(jsonlite)
con <- file('~/yasp/test.json')
jsonStream <- stream_in(con)

我得到与上图相同的“过早 EOF”错误。

但是，如果文件的格式都在一个块中，如下所示：

[{"match_id": 2000594819,"match_seq_num": 1764515493},{"match_id": 2000594820,"match_seq_num": 1764515494},{"match_id": 2000594821,"match_seq_num": 1764515495}]

然后就没有问题了，stream_in 工作正常。

我玩过使用 readLines，并在阅读之前折叠框架：

initialJSON <- readLines('~/yasp/test.json')
collapsedJSON <- paste(initialJSON, collapse="")

虽然这确实有效并创建了一个我可以从 JSON 读取的字符串，但这对我来说不是一个可扩展的解决方案，因为我一次只能读取几千行这样的内容，并且不是很可扩展（我也想喜欢能够直接从 gz 文件流式传输）。

有谁知道我怎样才能让 stream_in 接受这种文件格式，或者使用 R 的其他方式来做到这一点？他们展示了它如何在 Java 中正常工作的示例，但我希望能够在不跳入我并不真正了解的语言的情况下做到这一点。

更新

仍然没有让流工作，但写了我自己的（各种各样的），似乎对我的目的表现得体面。

fileCon <- file('~/yasp/test.json', open="r")

# Initialize everything
numMatches <- 5
outputFile <- 0
lineCount <- 0
matchCount <- 0
matchIDList <- NULL

# Stream using readLines and only look at even numbered lines
while(matchCount <= numMatches) {
    next_line = readLines(fileCon, n = 1)

    lineCount <- lineCount + 1

    if(lineCount %% 2 == 0) {

        matchCount <- matchCount + 1

        # read into JSON format
        readJSON <- jsonlite::fromJSON(next_line)

        # Up the match counter
        matchCount <- matchCount + 1

        # whatever operations you want, for example get match_id
        matchIDList <- c(matchIDList, readJSON$match_id)
    }

}

score 0 · Accepted Answer

好吧，我从来没有让 stream_in 函数为我工作，但我创建了自己的流媒体，它运行良好且占用空间小。

streamJSON <- function(con, pagesize, numMatches){
  library(jsonlite)
  library(data.table)
  library(plyr)
  library(dplyr)
  library(tidyr)

  ## "con" is the file connection
  ## "pagesize" is number of lines streamed for each iteration.
  ## "numMatches" is number of games we want to output

  outputFile <- 0
  matchCount <- 0
  print("Starting parsing games...")
  print(paste("Number of games parsed:",matchCount))
  # Stream in using readLines until we reach the number of matches we want.
  while(matchCount < numMatches) {

    initialJSON = readLines(con, n = pagesize)

    collapsedJSON <- paste(initialJSON[2:pagesize], collapse="")
    fixedJSON <- sprintf("[%s]", collapsedJSON, collapse=",")
    readJSON <- jsonlite::fromJSON(fixedJSON)

    finalList <- 0
    ## Run throught he initial file
    for (i in 1:length(readJSON$match_id)) {
      ## Some work with the JSON to return whatever it is i wanted to return
      ## In this example match_id, who won, and the duration.

      matchList <- as.data.frame(cbind(readJSON$match_id[[i]],
                                    readJSON$radiant_win[[i]],
                                    readJSON$duration[[i]]))
      colnames(matchList) <- c("match_id", "radiant_win", "duration")

      ## Assign to output
      if (length(finalList) == 1) {
        finalList <- matchList
      } else {
        finalList <- rbind.fill(finalList, matchList)
      } 
    }

    matchCount <- matchCount + length(unique(finalList[,1]))

    if (length(outputFile) == 1) {
       outputFile <- finalList
    } else {
      outputFile <- rbind.fill(outputFile, finalList)
    } 
    print(paste("Number of games parsed:",matchCount))
  }
  return(outputFile)
}

不确定这是否对其他人有帮助，因为它可能有点特定于 YASP 数据转储，但我现在可以像这样调用这个函数：

fileCon <- gzfile('~/yasp/yasp-dump-2015-12-18.json.gz', open="rb")
streamJSONPos(fileCon, 100, 500)

它将输出一个包含指定数据的 500 行数据框，然后我必须修改 while 循环中的部分，无论我希望从 JSON 数据中提取什么。

我已经能够很容易地流式传输 50.000 个匹配项（具有相当复杂的 JSON 函数），并且似乎在与 stream_in 函数相当的时间（每次匹配项）运行。

json - 无法将 jsonlite::stream_in 与某些 JSON 格式一起使用

更新

1 回答 1

Related

Reference