Partition your keys among your threads, e.g. with N threads you would give T1 the keys {0, key.size() / N - 1}, T2 gets the keys {key.size() / N, 2 * key.size() / N - 1}, etc., and TN gets the keys {(N - 1) / N * keys.size(), keys.size() - 1}. Each thread puts its results in a thread-local container, and you merge the containers when the threads have finished. This way you don't need to perform any synchronization between the threads.
The most efficient way to merge the containers is for the containers to be linked lists, since it's easy to append T2's list to T1's list and so on. However, as you said it's a good idea to avoid linked lists since they don't parallelize well.
Another option is to have each thread store its results in a thead-local array, and to then merge these arrays when the threads have completed; you can perform this merge in parallel (the size of each thread's results is T{N}results.size(); given the final merge array final_results
, T1 merges its data to final_results[0, T1results.size()-1]
, T2 merges its data to final_results[T1results.size(), T1results.size() + T2results.size()-1]
, T3 merges its results to final_results[T1results.size() + T2results.size(), T1results.size() + T2results.size + T3results.size()-1]
, etc).
Another option is to use a shared concurrent_hash_map from TBB, with key
as the key and data[key]
as the value.