sql - Spark SQL - 如何从纪元中选择存储为 UTC 毫秒的日期？

Question

我一直在搜索，但没有找到一个解决方案，即如何使用 Spark SQL 从纪元查询存储为 UTC 毫秒的日期。我从 NoSQL 数据源（来自 MongoDB 的 JSON）中提取的模式的目标日期为：

|-- dateCreated: struct (nullable = true)

||-- $date: long (nullable = true)

完整的架构如下：

scala> accEvt.printSchema
root
 |-- _id: struct (nullable = true)
 |    |-- $oid: string (nullable = true)
 |-- appId: integer (nullable = true)
 |-- cId: long (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- expires: struct (nullable = true)
 |    |    |-- $date: long (nullable = true)
 |    |-- metadata: struct (nullable = true)
 |    |    |-- another key: string (nullable = true)
 |    |    |-- class: string (nullable = true)
 |    |    |-- field: string (nullable = true)
 |    |    |-- flavors: string (nullable = true)
 |    |    |-- foo: string (nullable = true)
 |    |    |-- location1: string (nullable = true)
 |    |    |-- location2: string (nullable = true)
 |    |    |-- test: string (nullable = true)
 |    |    |-- testKey: string (nullable = true)
 |    |    |-- testKey2: string (nullable = true)
 |-- dateCreated: struct (nullable = true)
 |    |-- $date: long (nullable = true)
 |-- id: integer (nullable = true)
 |-- originationDate: struct (nullable = true)
 |    |-- $date: long (nullable = true)
 |-- processedDate: struct (nullable = true)
 |    |-- $date: long (nullable = true)
 |-- receivedDate: struct (nullable = true)
 |    |-- $date: long (nullable = true)

我的目标是按照以下方式编写查询：

SELECT COUNT(*) FROM myTable WHERE dateCreated BETWEEN [dateStoredAsLong0] AND [dateStoredAsLong1]

到目前为止，我的过程是：

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@29200d25

scala> val accEvt = sqlContext.jsonFile("/home/bkarels/mongoexport/accomplishment_event.json")

...
14/10/29 15:03:38 INFO SparkContext: Job finished: reduce at JsonRDD.scala:46, took 4.668981083 s
accEvt: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[6] at RDD at SchemaRDD.scala:103

scala> accEvt.registerAsTable("accomplishmentEvent")

（此时下面的基线查询执行成功）

scala> sqlContext.sql("select count(*) from accomplishmentEvent").collect.foreach(println)
...
[74475]

现在，我无法正确理解的巫术是如何形成我的选择语句来推理日期。例如，以下执行无错误，但返回零而不是所有记录的计数（74475）。

scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate >= '1970-01-01'").collect.foreach(println)
...
[0]

我也尝试过一些丑陋的东西，比如：

scala> val now = new java.util.Date()
now: java.util.Date = Wed Oct 29 15:05:15 CDT 2014

scala> val today = now.getTime
today: Long = 1414613115743

scala> val thirtydaysago = today - (30 * 24 * 60 * 60 * 1000)
thirtydaysago: Long = 1416316083039


scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate <= %s and processedDate >= %s".format(today,thirtydaysago)).collect.foreach(println)

按照建议，我选择了一个命名字段以确保其有效。所以：

scala> sqlContext.sql("select receivedDate from accomplishmentEvent limit 10").collect.foreach(println)

返回：

[[1376318850033]]
[[1376319429590]]
[[1376320804289]]
[[1376320832835]]
[[1376320832960]]
[[1376320835554]]
[[1376320914480]]
[[1376321041899]]
[[1376321109341]]
[[1376321121469]]

然后扩展以尝试使我尝试过的某些日期起作用：

scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.date > '1970-01-01' limit 5").collect.foreach(println)

导致错误：

java.lang.RuntimeException: No such field date in StructType(ArrayBuffer(StructField($date,LongType,true)))
...

在我们的字段名称前面加上$同样建议的前缀会导致另一种错误：

scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5").collect.foreach(println)
java.lang.RuntimeException: [1.69] failure: ``UNION'' expected but ErrorToken(illegal character) found

select actualConsumerId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5

显然我不知道如何选择以这种方式存储的日期 - 谁能帮我填补这个空白？

我对 Scala 和 Spark 都比较陌生，所以如果这是一个基本问题，请原谅我，但我在论坛和 Spark 文档上的搜索结果为空。

谢谢你。

score 1 · Accepted Answer

您的 JSON 不是扁平的，因此顶层以下的字段需要使用限定名称来处理，例如dateCreated.$date. 您的特定日期字段都是long类型，因此您需要对它们进行数值比较，看起来您在正确的轨道上进行这些操作。

另一个问题是您的字段名称包含“$”字符，Spark SQL 不允许您查询它们。一种解决方案是，不是直接将 JSON 读取为SchemaRDD（如您所做的那样），而是首先将其读取为RDD[String]，使用该map方法执行您选择的 Scala 字符串操作，然后使用SQLContext'jsonRDD方法创建SchemaRDD.

val lines = sc.textFile(...)
// you may want something less naive than global replacement of all "$" chars
val linesFixed = lines.map(s => s.replaceAllLiterally("$", ""))
val accEvt = sqlContext.jsonRDD(linesFixed)

我已经用 Spark 1.1.0 对此进行了测试。

作为参考，Spark SQL 中缺乏引用功能已在此错误报告和其他错误报告中指出，并且似乎该修复最近已签入，但需要一些时间才能使其成为发行版

sql - Spark SQL - 如何从纪元中选择存储为 UTC 毫秒的日期？

1 回答 1

Related

Reference