我有一个大约几百万行的天文数据集,这就是它的外观,
oid mjd mag magerr ra dec
0 1809105320673280.0 58338.42578125 20.6175079345703125 0.1499880552291870 12.0123176574707031 56.8318214416503906
1 1809105320673280.0 58365.42968750 20.7830238342285156 0.1610205173492432 12.0121049880981445 56.8318862915039062
2 1809105320673280.0 58377.37500000 20.7814407348632812 0.1609148979187012 12.0120792388916016 56.8319053649902344
3 1809105320673280.0 58389.36328125 20.6266822814941406 0.1505994796752930 12.0119419097900391 56.8318405151367188
4 1809105320673280.0 58430.28906250 20.7284736633300781 0.1573843955993652 12.0120868682861328 56.8317718505859375
... ... ... ... ... ... ...
8474460 381208110301184.0 58711.27343750 19.1085929870605469 0.0534130744636059 257.3913269042968750 -10.2478170394897461
8474461 381208110301184.0 58723.13671875 19.4006576538085938 0.0655696913599968 257.3913879394531250 -10.2481222152709961
8474462 381208110301184.0 58726.13281250 19.4201564788818359 0.0664852634072304 257.3913574218750000 -10.2475624084472656
8474463 381208110301184.0 58737.16796875 19.3793220520019531 0.0645836368203163 257.3914184570312500 -10.2481050491333008
8474464 381208110301184.0 58765.10937500 19.3963356018066406 0.0653686374425888 257.3912658691406250 -10.2478036880493164
我需要将它分成单独的源文件。首先,我根据观察 ID (oid) 对数据进行分组。然后,我在不同的组中使用 min ra & dec 来计算角距离;
PyAstronomy.pyasl.getAngDist(ra1,dec1,ra2,dec2))
ra1 和 dec1 属于一个组,而 ra2 和 dec2 属于另一个组。如果角距离小于某个值,代码会将它们写入同一个文件。
代码是;
#!/usr/bin/env python3
import numpy as np
import pandas as pd
import glob
from PyAstronomy import pyasl
def data():
cols = ['oid', 'mjd', 'mag', 'magerr', 'ra', 'dec']
threshold = 1.5 / 3600
df = pd.read_hdf('ztf_dr3.txt',dtype={'8':np.float32, '9':np.float32})
pd.set_option('display.precision', 16)
# data = data.apply(pd.to_numeric, errors='coerce')
df.columns = cols
edf = pd.DataFrame(columns=cols)
grouped = df.groupby(['oid'])
for name, i in grouped:
edf = edf.append(i, ignore_index=True)
for name, j in grouped:
ang_dist = pyasl.getAngDist(i['ra'].min(), i['dec'].min(), j['ra'].min(), j['dec'].min())
if (ang_dist <= threshold):
edf = edf.append(j, ignore_index=True)
edf.to_csv('result/' + str(i['oid'].min()) + '.txt', columns=cols, header=True, index=False)
edf = pd.DataFrame(columns=cols)
它可以正常工作,但是速度很慢。我试着把它写成综合形式,
def data():
pd.set_option('display.precision', 16)
grouped = pd.read_hdf('ztf_dr3.txt',dtype={'8':np.float32, '9':np.float32}).groupby(['0'])
edf = pd.DataFrame([[i, j]
for name, i in grouped for name, j in grouped
if
(pyasl.getAngDist(i['8'].min(), i['9'].min(), j['8'].min(), j['9'].min()) <= 1.5 / 3600.0)
])
return edf.to_csv('result/{}.txt'.format("edf['0'].min()"))
问题是嵌套的综合列表使用了大量的内存(x3)。
对于如何以嵌套的综合形式编写此代码的任何想法,我将不胜感激。
示例数据文件
https://drive.google.com/file/d/1naJFXLJOjsQ2nVnGFX5WWH-hBdxikGq3/view?usp=sharing
编辑:它不必是嵌套的综合列表形式,我只是想让它更快。