I'm having a hard time implementing something that seems like it should be very easy:
My goal is to make translations in an RDD/dataframe using a second RDD/dataframe as a lookup table or translation dictionary. I want to make these translations in multiple columns.
The easiest way to explain the problem is by example. Let's say I have as my input the following two RDDs:
Route SourceCityID DestinationCityID
A 1 2
B 1 3
C 2 1
and
CityID CityName
1 London
2 Paris
3 Tokyo
My desired output RDD is:
Route SourceCity DestinationCity
A London Paris
B London Tokyo
C Paris London
How should I go about it producing it?
This is an easy problem in SQL, but I don't know of obvious solutions with RDDs in Spark. The join, cogroup, etc methods seem to not be well-suited to multi-column RDDs and don't allow specifying which column to join on.
Any ideas? Is SQLContext the answer?