0

The code below should:

  • iterate over a sequence of strings
  • parse each one as json,
  • filter out fields whose names could not be used as an identifier in most languages
  • lowercase the rmaining names
  • serialize the result as a string

It behaves as expected on small tests, but on an 8.6M item sequence of live data the output sequence is significantly longer than the input sequence:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._

val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
         line <- txt
         JObject(children) <- parse(line)
         children2 = (for {
           JField(name, value) <- children

           // filter fields with invalid names
           // patt(name) returns Option[String]
           _ <- patt(name)

         } yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))

I have checked that it actually increases the number of unique items, so it is not just duplicating items. Given my understanding of Scala comprehensions & json4s, I do not see how this is possible. The large live data collection is a Spark RDD, while my tests were with an ordinary Scala Seq, but that should not make any difference.

How can json have more elements than txt in the above code?

4

2 回答 2

1

也许parse(line)为一行返回多个 JSON 对象?

于 2014-10-27T12:23:08.513 回答
1

我不知道

JObject(children) <- parse(line)

在 的结果中递归匹配parse。因此,即使parse返回单个值,当有嵌套对象时,它们也会作为单独的绑定返回children。答案是使用

JObject(children) = parse(line)

正确的代码是:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._

val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
         line <- txt
         JObject(children) = parse(line) // CHANGED <- TO =
         children2 = (for {
           JField(name, value) <- children

           // filter fields with invalid names
           // patt(name) returns Option[String]
           _ <- patt(name)

         } yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))
于 2014-10-27T22:10:44.523 回答