spark sql 是否提供任何方法来自动加载 csv 数据?我找到了以下 Jira:https ://issues.apache.org/jira/browse/SPARK-2360但它已关闭....
目前我会加载一个csv文件,如下所示:
case class Record(id: String, val1: String, val2: String, ....)
sc.textFile("Data.csv")
.map(_.split(","))
.map { r =>
Record(r(0),r(1), .....)
}.registerAsTable("table1")
从 csv 文件中自动推断模式的任何提示?特别是a)我如何生成一个代表模式的类和b)我如何自动填充它(即Record(r(0),r(1),.....))?
更新:我在这里找到了模式生成的部分答案:http: //spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
所以剩下的唯一问题是如何
map(p => Row(p(0), p(1).trim))
为给定数量的属性动态地执行该步骤?
谢谢你的支持!约尔格