0

I want to filter rows of DataFrame in SparkR by time stamp with format like the following:

df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
                                             Timestamp=c('08/01/2014 11:18:30',
                                                         '01/01/2015 12:13:45',
                                                         '05/01/2015 14:17:33')))

Please note that original schema for TimeStamp column is String. Say I want to filter those time stamp before 03/01/2015 00:00:00, I think there might be two approaches to do this:

One is to mutate the column to timestamp like normal R with dplyr and lubridate:

df %>%
 mutate(Timestamp = mdy_hms(Timestamp)) %>%
 filter(Timestamp < mdy_hms('03/01/2015 00:00:00'))

But I failed to mutate columns of DataFrame since it's a S4 class Column not a vector.

Second approach might be to register the DataFrame as a table and then use SparkSQL to deal with timestamp type:

df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
                                             Timestamp=c('08/01/2014 11:18:30',
                                                         '01/01/2015 12:13:45',
                                                         '05/01/2015 14:17:33')))
registerTempTable(df, 'df')
head(sql(sqlContext, 'SELECT * FROM df WHERE Timestamp < "03/01/2015 00:00:00"'))

But since it's still a string comparison so it would give wrong result. What would be correct way to do this?

4

1 回答 1

3

火花 1.6+

您应该能够使用unix_timestampfunction 和 standard SQLContext

ts <- unix_timestamp(df$Timestamp, 'MM/dd/yyyy HH:mm:ss') %>%
  cast("timestamp")

df %>% 
   where(ts <  cast(lit("2015-03-01 00:00:00"), "timestamp"))

火花 < 1。

这应该可以解决问题:

sqlContext <- sparkRHive.init(sc)

query <- "SELECT * FROM df
    WHERE unix_timestamp(Timestamp, 'MM/dd/yyyy HH:mm:ss') < 
          unix_timestamp('2015-03-01 00:00:00')" # yyyy-MM-dd HH:mm:ss 

df <- createDataFrame(sqlContext, ...)
registerTempTable(df, 'df')

head(sql(sqlContext, query))

##   ID           Timestamp
## 1  1 08/01/2014 11:18:30
## 2  2 01/01/2015 12:13:45

请注意,上下文的选择在这里很重要。由于unix_timestamp是 Hive UDF 标准SQLContext,因此您在 SparkR 中默认获得的标准在这里不起作用。

于 2015-09-06T08:27:30.423 回答