I need to process one billion of records periodically. The unique keys can be in range of 10 millions. Value is string with maximum 200K chars.
Here are my questions:
Is the key space very large (10 millions). Would Hadoop be able to handle such a large key space? There will be one reducer per key, so there will be millions of reducers.
I want to update the DB in the reducer itself. In the reducer, I will merge the values (say it current value), read existing value from DB (say it existing value), merge current and existing value and update the DB. Is this a right strategy?
How many reducers can run per box simultaneously? Is it configurable? If only a single reducer runs per box at a time, it will be problem, as I won't be able to update the state for keys in DB very fast.
I want the job to get completed in 2-3 hours. How many boxes would I need ( I can spare max 50 boxes - 64 GB RAM, 8 Core machines)
Thanks