Assuming data frame 1 represents target country and list of source countries and data frame 2 represents the availability for all the countries, find all the pairs from data frame 1 where target country mapping is TRUE and source country mapping is FALSE:
Dataframe 1 (targetId, sourceId):
USA: China, Russia, India, Japan
China: USA, Russia, India
Russia: USA, Japan
Dataframe 2 (id, available):
USA: true
China: false
Russia: true
India: false
Japan: true
Result Dataset should look like:
(USA, China),
(USA, India)
My idea is to first explode the data set1, create new data frame (say, tempDF), add 2 new columns to it: targetAvailable, sourceAvailable and finally filter for targetAvailable = false and sourceAvailable = true to get the desired result data frame.
Below is the snippet of my code:
val sourceDF = sourceData.toDF("targetId", "sourceId")
val mappingDF = mappingData.toDF("id", "available")
val tempDF = sourceDF.select(col("targetId"),
explode(col("sourceId")).as("source_id_split"))
val resultDF = tempDF.select("targetId")
.withColumn("targetAvailable", isAvailable(tempDF.col("targetId")))
.withColumn("sourceAvailable", isAvailable(tempDF.col("source_id_split")))
/*resultDF.select("targetId", "sourceId").
filter(col("targetAvailable") === "true" and col("sourceAvailable")
=== "false").show()*/
// udf to find the availability value for the given id from the mapping table
val isAvailable = udf((searchId: String) => {
val rows = mappingDF.select("available")
.filter(col("id") === searchId).collect()
if (rows(0)(0).toString.equals("true")) "true" else "false" })
Calling isAvailable
UDF while calculating the resultDF
throws me some weird exception. Am I doing something wrong? is there a better / simpler way to do this?