First the imaginary use case. Let's say I have a stream of tuples (user_id, time_stamp, login_ip)
. I want to maintain the last login IP of each user at 5 seconds granularity.
Using Spark streaming, I can use the updateStateByKey
method to update this map. The problem is, as the stream of data keeps coming, the RDD of each time interval is becoming larger and larger because more user_ids
are seen. After sometime, the map will become so large that maintaining it takes longer time, thus the real-time delivery of the result can not be achieved.
Note that this is just a simple example that I come up with to show the problem. Real problems could be more complicated and really need real-time delivery.
Any idea (In Spark as well as other solutions will all be good) on how to solve this problem?