1

I have few tens of full sky maps, in binary format (FITS) of about 600MB each.

For each sky map I already have a catalog of the position of few thousand sources, i.e. stars, galaxies, radio sources.

For each source I would like to:

  • open the full sky map
  • extract the relevant section, typically 20MB or less
  • run some statistics on them
  • aggregate the outputs to a catalog

I would like to run hadoop, possibly using python via the streaming interface, to process them in parallel.

I think the input to the mapper should be each record of the catalogs, then the python mapper can open the full sky map, do the processing and print the output to stdout.

  1. Is this a reasonable approach?
  2. If so, I need to be able to configure hadoop so that a full sky map is copied locally to the nodes that are processing one of its sources. How can I achieve that?
  3. Also, what is the best way to feed the input data to hadoop? for each source I have a reference to the full sky map, latitude and longitude
4

1 回答 1

2

虽然听起来你的几十个天空地图不是一个非常大的数据集,但我已经成功地使用 Hadoop 作为编写分布式应用程序/脚本的简单方法。

对于您描述的问题,我会尝试使用 Pydoop 实施解决方案,特别是Pydoop 脚本(完全免责声明:我是 Pydoop 开发人员之一)。

您可以设置一个作业,将您要处理的天空地图部分列表作为输入,以某种文本格式序列化,每行一条记录。每个地图任务都应该处理其中之一;您可以使用标准的 NLineInputFormat 轻松实现这种拆分。

您不需要将天空地图本地复制到所有节点,只要地图任务可以访问存储它的文件系统即可。使用 pydoop.hdfs 模块,map 函数可以读取它需要处理的天空地图部分(给定它作为输入接收到的坐标),然后按照您所说的那样发出统计信息,以便可以在 reducer 中聚合它们. pydoop.hdfs 可以从“标准”挂载文件系统和 HDFS 中读取。

尽管问题域完全不相关,但此应用程序可以作为示例:

https://github.com/ilveroluca/seal/blob/master/seal/dist_bcl2qseq.py#L145

它使用相同的策略,准备要处理的“坐标”列表,将它们序列化到一个文件中,然后启动一个简单的 pydoop 作业,将该文件作为输入。

希望有帮助!

于 2013-07-18T10:52:15.517 回答