I have log files going into different directories based on the date created of the log file.
For example
> /mypath/2017/01/20/...
.
.
.
> /mypath/2017/02/13/...
> /mypath/2017/02/14/...
I would like to combine all these log files into one single rdd using pyspark so that I can do the aggregates on this master file.
Till date, I have taken individual directories, called sqlContext and used Union to join all the log file for specific dates.
DF1 = (sqlContext.read.schema(schema).json("/mypath/2017/02/13")).union(sqlContext.read.schema(schema).json("/mypath/2017/02/14"))
Is there an easy way to get the master rdd by specifying the log files from range of dates? (i.e from 2017/01/20 to 2017/02/14)
I am quite new to spark, please correct me if I was wrong at any step.