scala - Spark SQL：来自 csv 的自动模式

Question

spark sql 是否提供任何方法来自动加载 csv 数据？我找到了以下 Jira：https ://issues.apache.org/jira/browse/SPARK-2360但它已关闭....

目前我会加载一个csv文件，如下所示：

case class Record(id: String, val1: String, val2: String, ....)

 sc.textFile("Data.csv")
.map(_.split(",")) 
.map { r =>                  
   Record(r(0),r(1), .....)
}.registerAsTable("table1")

从 csv 文件中自动推断模式的任何提示？特别是a）我如何生成一个代表模式的类和b）我如何自动填充它（即Record（r（0），r（1），.....））？

更新：我在这里找到了模式生成的部分答案：http: //spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources

// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
 StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

所以剩下的唯一问题是如何 map(p => Row(p(0), p(1).trim))为给定数量的属性动态地执行该步骤？

谢谢你的支持！约尔格

score 5 · Accepted Answer

5

您可以使用spark-csv来节省一些击键，而不必定义列名并自动使用标题。

于 2015-02-25T01:19:43.700 回答

score 5 · Accepted Answer

val schemaString = "name age".split(" ")
// Generate the schema based on the string of schema
val schema =   StructType(schemaString.map(fieldName => StructField(fieldName, StringType, true)))
val lines = people.flatMap(x=> x.split("\n"))
val rowRDD = lines.map(line=>{
  Row.fromSeq(line.split(" "))
})
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

可能这个链接会帮助你。

http://devslogics.blogspot.in/2014/11/spark-sql-automatic-schema-from-csv.html

scala - Spark SQL：来自 csv 的自动模式

2 回答 2

Related

Reference