To start with, 16 words does not mean 16 different threads. Each thread in itself brings factors like synchronization overhead, unequal work distribution, spawn-relax time etc., if not carefully worked out, will result in completely compromising the benefits of parallelism.
Parallelism can be exploited greatly in absence of dependency between work data. Your job is independent of the work data in either of the possible solution you have expressed. To look for some characters within a large pool of characters, has nothing to do with the arrangement of characters (e.g. Look of "++" after every "C" vs Look for "++").
The question now is how do you utilize the memory bandwidth and how to dissolve the lost by "Branch Prediction".
Common Data Different threads is very good in memory bound situation. The first thread to read data would load it into cache and others would just end up utilizing this chunk of data. So data would be loaded only once from the memory. Split data will also result for loading the data only once from the memory, as each thread would load its part and then flush out. So, Memory load is very much similar in both cases.
Your code is something called "logically bound". Branch Prediction is very important aspect to take care of but is very tricky. The common data module has a high probability of failing it. Say you have to find a red colored ball in a million balls. The prediction would tend to make the natural choice of "not-present". But say it fails within the first 100 balls, all the processed balls after this fail would be end up in a redo and you are back to square one. On the contrary, In case of splitting, the worst scenario of "redo" would not be bad as the common data. But we cannot guarantee that.
Choosing any of the above modules then completely depends on the architecture at you disposal.
SMT / Logical cores vs Physical Cores. Using any of the above method with SMT would result in less performance w.r.t same no. of Physical cores. Your idea of one thread splitting into sub threads is basis of SMT. Each new thread than adds up chase for resources with synchronization overhead and also a possibility of branch misprediction. In case of increasing Physical Threads, branch Prediction limits the performance.
Splitting or Common data, your code is not scalable. The best thing to do would be to utilize minimum no. of available physical cores. Common data brings less complexity and should be very easy to work on. The minimum no. of threads can be found by experimenting. The performance would just be constant after a this value.