I want to filter rows of DataFrame
in SparkR by time stamp with format like the following:
df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
Timestamp=c('08/01/2014 11:18:30',
'01/01/2015 12:13:45',
'05/01/2015 14:17:33')))
Please note that original schema for TimeStamp
column is String
. Say I want to filter those time stamp before 03/01/2015 00:00:00
, I think there might be two approaches to do this:
One is to mutate the column to timestamp
like normal R with dplyr
and lubridate
:
df %>%
mutate(Timestamp = mdy_hms(Timestamp)) %>%
filter(Timestamp < mdy_hms('03/01/2015 00:00:00'))
But I failed to mutate columns of DataFrame
since it's a S4 class Column
not a vector.
Second approach might be to register the DataFrame
as a table and then use SparkSQL
to deal with timestamp
type:
df <- createDataFrame(sqlContext, data.frame(ID = c(1,2,3),
Timestamp=c('08/01/2014 11:18:30',
'01/01/2015 12:13:45',
'05/01/2015 14:17:33')))
registerTempTable(df, 'df')
head(sql(sqlContext, 'SELECT * FROM df WHERE Timestamp < "03/01/2015 00:00:00"'))
But since it's still a string comparison so it would give wrong result. What would be correct way to do this?