java - Scalable patterns for thread-safe hashtable puts when keeping track of frequency

Question

This was an interview question I got some time last week and it ended at a cliffhanger. The question was simple: Design a service that keeps track of the frequency of "messages" (a 1 line string, could be in different languages) passed to it. There are 2 broad apis: submitMsg(String msg) and getFrequency(String msg). My immediate reaction was to use as hashMap that uses a String as a key (in this case, a message) and an Integer as a value (to keep track of counts/frequency).

The submitMsg api simply sees whether a message exists in the hashMap. If it doesn't, put the message and set the frequency to 1; if it does, then get the current count and increment it by 1. The interviewer then pointed out this would fail miserably in the event multiple threads access the SAME key at the SAME exact time.

For example: At 12:00:00:000 Thread1 would try to "submitMsg", and thereby my method would do a (1) get on the hashMap and see that the value is not null, it is infact, say 100 (2) do a put by incrementing the frequency by 1 so that the key's value is 101. Meanwhile consider that Thread2 ALSO tried to do a submitMsg at exactly At 12:00:00:000, and the method once again internally did a get on the hashMap (which returned a 100 - this is a race condition), after which the hashMap now increments the frequency to 101. Alas, the true frequency should have been 102 and not 101, and this is a major design flaw in a largely multithreaded environment. I wasn't sure how to stop this from happening: Putting a lock on simply the write isn't good enough, and having a lock on a read didn't make sense. What would have been ideal is to "lock" an element if a get was invoked internally via the submitMsg api because we expect it to be "written to" soonafter. The lock would be released once the frequency had been updated, but if someone were to use the getFrequency() api having a pure lock wouldn't make sense. I'm not sure whether a mutex would help here because I don't have a strong background in distributed systems.

I'm looking to the SO community for help on the best way to think through a problem like this. Is the magic in the datastructure to be used or some kind of synchronization that I need to do in my api itself? How can we maintain the integrity of "frequency" while maintaining the scalability of the service as well?

score 4 · Accepted Answer

好吧，你最初的想法不是一百万英里，你只需要让它线程安全。例如，您可以使用ConcurrentHashMap<String, AtomicInteger>.

public void submitMsg(String msg) {
    AtomicInteger previous = map.putIfAbsent(msg, new AtomicInteger(1));
    if (null != previous) {
        previous.incrementAndGet();
    }
}

score 2 · Accepted Answer

最简单的解决方案是使用 Guava 的 com.google.common.collect.ConcurrentHashMultiset：

private final ConcurrentHashMultiset<String> multiset = ConcurrentHashMultiset.create();

public void submitMsg(String msg) {
    multiset.add(msg);
}

public int count(String msg) {
    return multiset.count(msg);
}

但这与 Aurand 的解决方案基本相同，只是有人已经实现了一些无聊的细节，比如在计数器不存在时创建计数器等。

score 1 · Accepted Answer

将其视为生产者-消费者问题。

服务是生产者；它应该将每条消息添加到为消费者提供服务的队列中。您可以为每个生产者运行一个队列，以确保生产者不会等待。

消费者封装 HashTable，将消息从队列中拉出并更新表。

java - Scalable patterns for thread-safe hashtable puts when keeping track of frequency

3 回答 3

Related

Reference