3

我有一个大约 50GB 的 fit 文件,其中包含多个 HDU,它们都具有相同的格式:一个 (1E5 x 1E6) 数组,其中包含 1E5 个对象和 1E6 个时间戳。HDU 描述了不同的物理属性,例如通量、RA、DEC 等。我只想从每个 HDU 中读取 5 个对象(即 (5 x 1E6) 阵列)。

python 2.7,astropy 1.0.3,linux x86_64

到目前为止,我尝试了很多我发现的建议,但没有任何效果。我最好的方法仍然是:

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array

此代码适用于高达 ~20 GB 的文件,运行良好且快速,但对于较大的文件会耗尽内存(较大的文件仅包含更多对象,而不包含更多时间戳)。我不明白为什么 - astropy.io.fits 本质上使用 mmap 并且据我了解应该只将 (5x1E6) 数组加载到内存中?因此独立于文件大小,我想读出的内容总是相同的大小。

编辑 - 这是错误消息:

  dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/utils/decorators.py", line 341, in __get__
  val = self._fget(obj)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 239, in data
  data = self._get_scaled_image_data(self._data_offset, self.shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 585, in _get_scaled_image_data
  raw_data = self._get_raw_data(shape, code, offset)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/base.py", line 523, in _get_raw_data
  return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/file.py", line 248, in readarray
  shape=shape).view(np.ndarray)
File "/usr/local/python/lib/python2.7/site-packages/numpy/core/memmap.py", line 254, in __new__
  mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 12] Cannot allocate memory

编辑 2:谢谢,我现在包含了建议,它使我能够处理高达 50GB 的适合文件。新代码:

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, mode='denywrite', memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    del hdulist['FLUX'].data
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    del hdulist['RA'].data
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
    del hdulist['DEC'].data

mode='denywrite'

没有引起任何改变。

memmap=True 

确实不是默认的,需要手动设置。

del hdulist['FLUX'].data 

etc 现在允许我读取 50GB 而不是 20GB 的文件

新问题:任何大于 50GB 的内容仍会导致相同的内存错误 - 但是,现在直接在第一行。

dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
4

1 回答 1

4

看起来您遇到了这个问题:https ://github.com/astropy/astropy/issues/1380

这里的问题是,即使它使用 mmap,它也是在写时复制模式下使用 mmap,这意味着您的系统需要能够分配足够大的虚拟内存区域,该区域原则上可以容纳与mmap,以防您将数据写回 mmap。

如果你通过mode='denywrite'fits.open()应该工作。任何修改数组的尝试都会导致错误,但如果您只想读取数据,那就没问题了。

如果你仍然不能让它工作,你也可以尝试fitio模块,它更好地支持读取小块文件。

于 2016-03-03T08:30:27.780 回答