This was an interview question I got some time last week and it ended at a cliffhanger. The question was simple: Design a service that keeps track of the frequency of "messages" (a 1 line string, could be in different languages) passed to it. There are 2 broad apis: submitMsg(String msg) and getFrequency(String msg). My immediate reaction was to use as hashMap that uses a String as a key (in this case, a message) and an Integer as a value (to keep track of counts/frequency).
The submitMsg api simply sees whether a message exists in the hashMap. If it doesn't, put the message and set the frequency to 1; if it does, then get the current count and increment it by 1. The interviewer then pointed out this would fail miserably in the event multiple threads access the SAME key at the SAME exact time.
For example: At 12:00:00:000 Thread1 would try to "submitMsg", and thereby my method would do a (1) get on the hashMap and see that the value is not null, it is infact, say 100 (2) do a put by incrementing the frequency by 1 so that the key's value is 101. Meanwhile consider that Thread2 ALSO tried to do a submitMsg at exactly At 12:00:00:000, and the method once again internally did a get on the hashMap (which returned a 100 - this is a race condition), after which the hashMap now increments the frequency to 101. Alas, the true frequency should have been 102 and not 101, and this is a major design flaw in a largely multithreaded environment. I wasn't sure how to stop this from happening: Putting a lock on simply the write isn't good enough, and having a lock on a read didn't make sense. What would have been ideal is to "lock" an element if a get was invoked internally via the submitMsg api because we expect it to be "written to" soonafter. The lock would be released once the frequency had been updated, but if someone were to use the getFrequency() api having a pure lock wouldn't make sense. I'm not sure whether a mutex would help here because I don't have a strong background in distributed systems.
I'm looking to the SO community for help on the best way to think through a problem like this. Is the magic in the datastructure to be used or some kind of synchronization that I need to do in my api itself? How can we maintain the integrity of "frequency" while maintaining the scalability of the service as well?