我有一个大约 50GB 的 fit 文件,其中包含多个 HDU,它们都具有相同的格式:一个 (1E5 x 1E6) 数组,其中包含 1E5 个对象和 1E6 个时间戳。HDU 描述了不同的物理属性,例如通量、RA、DEC 等。我只想从每个 HDU 中读取 5 个对象(即 (5 x 1E6) 阵列)。
python 2.7,astropy 1.0.3,linux x86_64
到目前为止,我尝试了很多我发现的建议,但没有任何效果。我最好的方法仍然是:
#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000']
dic = {}
with fits.open(fname, memmap=True, do_not_scale_image_data=True) as hdulist:
# There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.
# First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates
# Second, read out the 5 object's time series
dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
此代码适用于高达 ~20 GB 的文件,运行良好且快速,但对于较大的文件会耗尽内存(较大的文件仅包含更多对象,而不包含更多时间戳)。我不明白为什么 - astropy.io.fits 本质上使用 mmap 并且据我了解应该只将 (5x1E6) 数组加载到内存中?因此独立于文件大小,我想读出的内容总是相同的大小。
编辑 - 这是错误消息:
dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/utils/decorators.py", line 341, in __get__
val = self._fget(obj)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 239, in data
data = self._get_scaled_image_data(self._data_offset, self.shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 585, in _get_scaled_image_data
raw_data = self._get_raw_data(shape, code, offset)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/base.py", line 523, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/file.py", line 248, in readarray
shape=shape).view(np.ndarray)
File "/usr/local/python/lib/python2.7/site-packages/numpy/core/memmap.py", line 254, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 12] Cannot allocate memory
编辑 2:谢谢,我现在包含了建议,它使我能够处理高达 50GB 的适合文件。新代码:
#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000']
dic = {}
with fits.open(fname, mode='denywrite', memmap=True, do_not_scale_image_data=True) as hdulist:
# There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.
# First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates
# Second, read out the 5 object's time series
dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
del hdulist['FLUX'].data
dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
del hdulist['RA'].data
dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
del hdulist['DEC'].data
这
mode='denywrite'
没有引起任何改变。
memmap=True
确实不是默认的,需要手动设置。
del hdulist['FLUX'].data
etc 现在允许我读取 50GB 而不是 20GB 的文件
新问题:任何大于 50GB 的内容仍会导致相同的内存错误 - 但是,现在直接在第一行。
dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array