r - Filter rows by timestamp in DataFrame of SparkR

Question

I want to filter rows of DataFrame in SparkR by time stamp with format like the following:

df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
                                             Timestamp=c('08/01/2014 11:18:30',
                                                         '01/01/2015 12:13:45',
                                                         '05/01/2015 14:17:33')))

Please note that original schema for TimeStamp column is String. Say I want to filter those time stamp before 03/01/2015 00:00:00, I think there might be two approaches to do this:

One is to mutate the column to timestamp like normal R with dplyr and lubridate:

df %>%
 mutate(Timestamp = mdy_hms(Timestamp)) %>%
 filter(Timestamp < mdy_hms('03/01/2015 00:00:00'))

But I failed to mutate columns of DataFrame since it's a S4 class Column not a vector.

Second approach might be to register the DataFrame as a table and then use SparkSQL to deal with timestamp type:

df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
                                             Timestamp=c('08/01/2014 11:18:30',
                                                         '01/01/2015 12:13:45',
                                                         '05/01/2015 14:17:33')))
registerTempTable(df, 'df')
head(sql(sqlContext, 'SELECT * FROM df WHERE Timestamp < "03/01/2015 00:00:00"'))

But since it's still a string comparison so it would give wrong result. What would be correct way to do this?

score 3 · Accepted Answer

火花 1.6+

您应该能够使用unix_timestampfunction 和 standard SQLContext：

ts <- unix_timestamp(df$Timestamp, 'MM/dd/yyyy HH:mm:ss') %>%
  cast("timestamp")

df %>% 
   where(ts <  cast(lit("2015-03-01 00:00:00"), "timestamp"))

火花 < 1。

这应该可以解决问题：

sqlContext <- sparkRHive.init(sc)

query <- "SELECT * FROM df
    WHERE unix_timestamp(Timestamp, 'MM/dd/yyyy HH:mm:ss') < 
          unix_timestamp('2015-03-01 00:00:00')" # yyyy-MM-dd HH:mm:ss 

df <- createDataFrame(sqlContext, ...)
registerTempTable(df, 'df')

head(sql(sqlContext, query))

##   ID           Timestamp
## 1  1 08/01/2014 11:18:30
## 2  2 01/01/2015 12:13:45

请注意，上下文的选择在这里很重要。由于unix_timestamp是 Hive UDF 标准SQLContext，因此您在 SparkR 中默认获得的标准在这里不起作用。

r - Filter rows by timestamp in DataFrame of SparkR

1 回答 1

Related

Reference