1

I have a use case wherein I use hadoop streaming to run an executable as map process. In the input side, I have large number of sequence files. Each seq file has, says 8 keys and corresponding values which are list of float arrays. Instead of letting one map process to process one seq file, I prefer to allocate a group of seq files to one map process. Hence, I decided to merge all those seq files into one large file. Assume this big seq file is made up of 50,000 small seq files.

  1. Now, is it possible to configure my hadoop streaming utility to allocate a portion of seq file to each map process?

  2. How to make each map process gets the list of file names that they need to process? How can I retrieve these information in my map executable? The executable is plain groovy script designed to process stdin. In such cases, how my stdin will look like (how to determine key/value pairs, and what will be their contents) Or, since I have merged sequence files they become one big file and lost their individual identities which means that I cannot have their filenames and I need to play with bunch of sequence files' key/values?

  3. I think, this big seq file will have key / value where key is filename and value is the contents of that file which in turn contains 8 keys and corresponding values? If this is the case, when hadoop splits this big file depending on the number of maps possible (lets say 10 map possible in my cluster), each map would get around 5000 keys and corresponding values? Then, in my map exec, how can I access these keys and values?

Any hint will greatly help

4

0 回答 0