python - RAM issue with DASK and its from_pandas function

Question

i'm trying to use DASK package in Python 3.4 for avoid RAM problems with large datasets, but i've notice a problem.

Using native fucntion "read_csv" i load big dataset into a dask dataframe using less than 150MB of RAM.

The same dataset read with PANDAS DB Connection (using limit and offset options) and dask fuction"from_pandas" fill my ram uo to 500/750 MB.

I can't undestand why this happens and i want to fix this issue.

Here the code:

def read_sql(schema,tab,cond):

sql_count="""Select count(*) from """+schema+"""."""+tab
if (len(cond)>0):
    sql_count+=""" where """+cond

a=pd.read_sql_query(sql_count,conn)
num_record=a['count'][0]

volte=num_record//10000
print(num_record)

if(num_record%10000>0):
    volte=volte+1

sql_base="""Select * from """+schema+"""."""+tab
if (len(cond)>0):
    sql_base+=""" where """+cond
sql_base+=""" limit 10000"""

base=pd.read_sql_query(sql_base,conn)

dataDask=dd.from_pandas(base, npartitions=None, chunksize=1000000)

for i in range(1,volte):
    if(i%100==0):
        print(i)
    sql_query="""Select * from """+schema+"""."""+tab
    if (len(cond)>0):
        sql_query+=""" where """+cond
    sql_query+=""" limit 10000 offset """+str(i*10000)

    a=pd.read_sql_query(sql_query,conn)

    b=dd.from_pandas(a , npartitions=None, chunksize=1000000)

    divisions = list(b.divisions)
    b.divisions = (None,)*len(divisions)
    dataDask=dataDask.append(b)

return dataDask



a=read_sql('schema','tabella','data>\'2016-06-20\'')

Thanks for help me

Waiting for news

score 4 · Accepted Answer

一个 dask.dataframe 由许多 pandas 数据帧组成，或者像read_csv计划按需计算这些数据帧这样的功能。它通过执行该计划以延迟计算数据帧来实现低内存执行。

使用数据帧时from_pandas，数据帧已经在内存中，因此 dask.dataframe 几乎无法避免内存爆炸。

在这种情况下，我看到三个解决方案：

构建一个dask.dataframe.read_sql函数来懒惰地从数据库中提取数据块。在撰写本文时，这里正在进行中：https ://github.com/dask/dask/pull/1181
用于dask.delayed在用户代码中实现相同的结果。请参阅http://dask.pydata.org/en/latest/delayed.html和http://dask.pydata.org/en/latest/delayed-collections.html（这是我的主要建议）
将您的数据库转储到类似 HDF 文件的文件中，其中已经有一个方便的 dask.dataframe 函数。

python - RAM issue with DASK and its from_pandas function

1 回答 1

Related

Reference