20

我正在使用 Pandas 库进行遥感时间序列分析。最终我想通过使用块大小将我的 DataFrame 保存到 csv,但我遇到了一个小问题。我的代码生成了 6 个 NumPy 数组,我将它们转换为 Pandas 系列。这些系列中的每一个都包含很多项目

>>> prcpSeries.shape
(12626172,)

我想将系列添加到 Pandas DataFrame (df) 中,以便我可以将它们逐块保存到 csv 文件中。

d = {'prcp': pd.Series(prcpSeries),
     'tmax': pd.Series(tmaxSeries),
     'tmin': pd.Series(tminSeries),
     'ndvi': pd.Series(ndviSeries),
     'lstm': pd.Series(lstmSeries),
     'evtm': pd.Series(evtmSeries)}

df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None

但是我的代码卡在下一行给出内存错误

df = pd.DataFrame(d)

有什么建议么?是否可以逐块填充 Pandas DataFrame 块?

4

1 回答 1

21

If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:

df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...

Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):

df = prcpSeries.to_frame(name='prcp')

However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):

df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...

df = pd.concat([df1, df2, ...], join='outer', axis=1)

For example:

In [21]: dfA = pd.DataFrame([1,2], columns=['A'])

In [22]: dfB = pd.DataFrame([1], columns=['B'])

In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
   A   B
0  1   1
1  2 NaN
于 2013-06-18T13:11:22.323 回答