I have a bunch of tuples which are in form of composite keys and values. For example,
tfile.collect() = [(('id1','pd1','t1'),5.0),
(('id2','pd2','t2'),6.0),
(('id1','pd1','t2'),7.5),
(('id1','pd1','t3'),8.1) ]
I want to perform sql like operations on this collection, where I can aggregate the information based on id[1..n] or pd[1..n] . I want to implement using the vanilla pyspark apis and not using SQLContext. In my current implementation I am reading from a bunch of files and merging the RDD.
def readfile():
fr = range(6,23)
tfile = sc.union([sc.textFile(basepath+str(f)+".txt")
.map(lambda view: set_feature(view,f))
.reduceByKey(lambda a, b: a+b)
for f in fr])
return tfile
I intend to create an aggregated array as a value. For example,
agg_tfile = [((id1,pd1),[5.0,7.5,8.1])]
where 5.0,7.5,8.1 represent [t1,t2,t3] . I am currently, achieving the same by vanilla python code using dictionaries. It works fine for smaller data sets. But I worry as this may not scale for larger data sets. Is there an efficient way achieving the same using pyspark apis ?