I have few tens of full sky maps, in binary format (FITS) of about 600MB each.
For each sky map I already have a catalog of the position of few thousand sources, i.e. stars, galaxies, radio sources.
For each source I would like to:
- open the full sky map
- extract the relevant section, typically 20MB or less
- run some statistics on them
- aggregate the outputs to a catalog
I would like to run hadoop
, possibly using python
via the streaming
interface, to process them in parallel.
I think the input to the mapper should be each record of the catalogs,
then the python
mapper can open the full sky map, do the processing and print the output to stdout
.
- Is this a reasonable approach?
- If so, I need to be able to configure
hadoop
so that a full sky map is copied locally to the nodes that are processing one of its sources. How can I achieve that? - Also, what is the best way to feed the input data to
hadoop
? for each source I have a reference to the full sky map, latitude and longitude