如果您使用的是 Spark 1.2,您可以执行以下操作(使用 Scala)...
如果您已经知道要使用哪些字段,则可以为这些字段构建架构并将此架构应用于 JSON 数据集。Spark SQL 将返回一个 SchemaRDD。然后,您可以将其注册并作为表查询。这是一个片段...
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")
当您查看 peopleSchemaRDD (peopleSchemaRDD.printSchema()) 的模式时,您只会看到名称和性别字段。
或者,如果您想探索数据集并在查看所有字段后确定所需的字段,您可以让 Spark SQL 为您推断架构。然后,您可以将 SchemaRDD 注册为表并使用投影来删除不需要的字段。这是一个片段...
// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")