I want to filter rows of DataFrame in SparkR by time stamp with format like the following:
df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
Timestamp=c('08/01/2014 11:18:30',
'01/01/2015 12:13:45',
'05/01/2015 14:17:33')))
Please note that original schema for TimeStamp column is String. Say I want to filter those time stamp before 03/01/2015 00:00:00, I think there might be two approaches to do this:
One is to mutate the column to timestamp like normal R with dplyr and lubridate:
df %>%
mutate(Timestamp = mdy_hms(Timestamp)) %>%
filter(Timestamp < mdy_hms('03/01/2015 00:00:00'))
But I failed to mutate columns of DataFrame since it's a S4 class Column not a vector.
Second approach might be to register the DataFrame as a table and then use SparkSQL to deal with timestamp type:
df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
Timestamp=c('08/01/2014 11:18:30',
'01/01/2015 12:13:45',
'05/01/2015 14:17:33')))
registerTempTable(df, 'df')
head(sql(sqlContext, 'SELECT * FROM df WHERE Timestamp < "03/01/2015 00:00:00"'))
But since it's still a string comparison so it would give wrong result. What would be correct way to do this?