python - astropy.io.fits 从具有多个 HDU 的大型拟合文件中读取行

Question

我有一个大约 50GB 的 fit 文件，其中包含多个 HDU，它们都具有相同的格式：一个 (1E5 x 1E6) 数组，其中包含 1E5 个对象和 1E6 个时间戳。HDU 描述了不同的物理属性，例如通量、RA、DEC 等。我只想从每个 HDU 中读取 5 个对象（即 (5 x 1E6) 阵列）。

python 2.7，astropy 1.0.3，linux x86_64

到目前为止，我尝试了很多我发现的建议，但没有任何效果。我最好的方法仍然是：

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array

此代码适用于高达 ~20 GB 的文件，运行良好且快速，但对于较大的文件会耗尽内存（较大的文件仅包含更多对象，而不包含更多时间戳）。我不明白为什么 - astropy.io.fits 本质上使用 mmap 并且据我了解应该只将 (5x1E6) 数组加载到内存中？因此独立于文件大小，我想读出的内容总是相同的大小。

编辑 - 这是错误消息：

  dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/utils/decorators.py", line 341, in __get__
  val = self._fget(obj)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 239, in data
  data = self._get_scaled_image_data(self._data_offset, self.shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 585, in _get_scaled_image_data
  raw_data = self._get_raw_data(shape, code, offset)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/base.py", line 523, in _get_raw_data
  return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/file.py", line 248, in readarray
  shape=shape).view(np.ndarray)
File "/usr/local/python/lib/python2.7/site-packages/numpy/core/memmap.py", line 254, in __new__
  mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 12] Cannot allocate memory

编辑 2：谢谢，我现在包含了建议，它使我能够处理高达 50GB 的适合文件。新代码：

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, mode='denywrite', memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    del hdulist['FLUX'].data
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    del hdulist['RA'].data
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
    del hdulist['DEC'].data

这

mode='denywrite'

没有引起任何改变。

memmap=True

确实不是默认的，需要手动设置。

del hdulist['FLUX'].data

etc 现在允许我读取 50GB 而不是 20GB 的文件

新问题：任何大于 50GB 的内容仍会导致相同的内存错误 - 但是，现在直接在第一行。

dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array

score 4 · Accepted Answer

看起来您遇到了这个问题：https ://github.com/astropy/astropy/issues/1380

这里的问题是，即使它使用 mmap，它也是在写时复制模式下使用 mmap，这意味着您的系统需要能够分配足够大的虚拟内存区域，该区域原则上可以容纳与mmap，以防您将数据写回 mmap。

如果你通过mode='denywrite'它fits.open()应该工作。任何修改数组的尝试都会导致错误，但如果您只想读取数据，那就没问题了。

如果你仍然不能让它工作，你也可以尝试fitio模块，它更好地支持读取小块文件。

python - astropy.io.fits 从具有多个 HDU 的大型拟合文件中读取行

1 回答 1

Related

Reference