I have the following table as a RDD:
Key Value
1 y
1 y
1 y
1 n
1 n
2 y
2 n
2 n
I want to remove all the duplicates from Value
.
Output should come like this:
Key Value
1 y
1 n
2 y
2 n
While working in pyspark, output should come as list of key-value pairs like this:
[(u'1',u'n'),(u'2',u'n')]
I don't know how to apply for
loop here. In a normal Python program it would have been very easy.
I wonder if there is some function in pyspark
for the same.