我有一个 m4.4xlarge(64 GB 内存)EC2 盒子。我正在和熊猫一起跑步。我收到以下内存错误。
我在运行大约 24 小时后得到了这个,这大约是任务完成所需的时间,所以我不确定错误是否是由于 RAM 不足,磁盘内存不足作为我执行 DF 的脚本的结尾.to_csv() 将大 DF 写入磁盘或 pandas/numpy 内部内存限制?
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/dask/async.py", line 248, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 4061, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 4179, in _apply_standard
result = result._convert(datetime=True, timedelta=True, copy=False)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 3004, in _convert
copy=copy)).__finalize__(self)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 2941, in convert
return self.apply('convert', **kwargs)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 2901, in apply
bm._consolidate_inplace()
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3278, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 4269, in _consolidate
_can_consolidate=_can_consolidate)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 4289, in _merge_blocks
new_values = _vstack([b.values for b in blocks], dtype)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 4335, in _vstack
return np.vstack(to_stack)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/numpy/core/shape_base.py", line 230, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
更新:
因此,根据 MRocklin 的回答,提供了一些额外的信息。
这是我执行该过程的方式:
def dask_stats_calc(dfpath,v1,v2,v3...):
dfpath_ddf = dd.from_pandas(dfpath,npartitions=16,sort=False)
return dfpath_ddf.apply(calculate_stats,axis=1,args=(dfdaily,v1,v2,v3...)).compute(get=get).stack().reset_index(drop=True)
f_threaded = partial(dask_stats_calc,dfpath,v1,v2,v3...,multiprocessing.get)
f_threaded()
现在问题dfpath
是一个 df 有 140 万行,因此dfpath_ddf.apply()
运行超过 140 万行。
一旦整个dfpath_ddf.apply()
完成,df.to_csv()
就会发生,但就像你说的那样,最好定期写入磁盘。
现在的问题是,我如何实现每隔 200k 行定期写入磁盘之类的东西?我想我可以分解dfpath_ddf
成 200k 块(或类似的东西)并依次运行每个块?