hadoop - How to maintain real-time when using Spark's stateful operation updateStateByKey

Question

First the imaginary use case. Let's say I have a stream of tuples (user_id, time_stamp, login_ip). I want to maintain the last login IP of each user at 5 seconds granularity.

Using Spark streaming, I can use the updateStateByKey method to update this map. The problem is, as the stream of data keeps coming, the RDD of each time interval is becoming larger and larger because more user_ids are seen. After sometime, the map will become so large that maintaining it takes longer time, thus the real-time delivery of the result can not be achieved.

Note that this is just a simple example that I come up with to show the problem. Real problems could be more complicated and really need real-time delivery.

Any idea (In Spark as well as other solutions will all be good) on how to solve this problem?

score 2 · Accepted Answer

您没有完全更新Map. 您提供的功能只是更新与一个键关联的状态，其余的由 Spark 完成。特别是它为您维护了一个类似地图RDD的键状态对——实际上，它们是一系列的，一个DStream. 所以状态的存储和更新就像其他一切一样是分布式的。如果更新不够快，您可以通过添加更多工作人员来扩大规模。

hadoop - How to maintain real-time when using Spark's stateful operation updateStateByKey

1 回答 1

Related

Reference