The code below should:
- iterate over a sequence of strings
- parse each one as json,
- filter out fields whose names could not be used as an identifier in most languages
- lowercase the rmaining names
- serialize the result as a string
It behaves as expected on small tests, but on an 8.6M item sequence of live data the output sequence is significantly longer than the input sequence:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark._
val txt = sc.textFile("s3n://...")
val patt="""^[a-zA-Z]\w*$""".r.findFirstIn _
val json = (for {
line <- txt
JObject(children) <- parse(line)
children2 = (for {
JField(name, value) <- children
// filter fields with invalid names
// patt(name) returns Option[String]
_ <- patt(name)
} yield JField(name.toLowerCase, value))
} yield compact(render(JObject(children2))))
I have checked that it actually increases the number of unique items, so it is not just duplicating items. Given my understanding of Scala comprehensions & json4s, I do not see how this is possible. The large live data collection is a Spark RDD, while my tests were with an ordinary Scala Seq, but that should not make any difference.
How can json
have more elements than txt
in the above code?