2

我希望有一个相对较大的 Pandas DataFrame,由 memmap(来自共享内存)的 ndarray 支持。我有有效的代码(如下),但是当我对数据框运行计算时,整体系统使用率(由顶部测量)上升,就好像进程正在复制数据一样。计算完成后,系统内存使用量回到基线。如果我直接使用 memmap 进行相同的计算,系统内存使用不会上升。有没有办法避免这种(显然)内存使用量的临时峰值?

(请注意,对于这两种情况,top 报告的单个 python 进程使用的内存百分比上升,FWIW。)

使用熊猫 0.20.3、numpy 1.13.1、python 2.7.11

代码 - 第一个脚本example_setup.py,设置共享内存 memmap:

import numpy

N = 7300000000  #this large N makes it really obvious on top what is happening
memmap_file = "/tmp/hello_world.bin"

progress_mod = 10000000
print N/progress_mod

if __name__ == "__main__":
    print "opening memmap_file:  {}".format(memmap_file)
    my_mm = numpy.memmap(memmap_file, dtype="float32", mode="w+", shape=(N,))

    print "writing to memmap file integers - N:  {}".format(N)
    for i in xrange(N):
        my_mm[i] = float(i)

        if (i%progress_mod) == 0:
            print "progress i:  {}".format(i)

    raw_input("pause here to allow other processes to use shared memory")

第二个脚本example_use.py,直接使用上面的 memmap 并作为 Pandas DataFrame 的支持:

import example_setup
import numpy
import pandas

if __name__ == "__main__":
    memmap_file = example_setup.memmap_file
    N = example_setup.N

    print "opening memmap_file:  {}".format(memmap_file)
    my_mm = numpy.memmap(example_setup.memmap_file, dtype="uint32", mode="r", shape=(N,))

    print "calculate mean of my_mm, monitor memory using top.  This process will show increase usage, but system usage will not increase"
    my_mean = my_mm.mean()
    print "my_mean:  {}".format(my_mean)
    raw_input("pause here before doing the above with a dataframe backed by my_mm")

    df = pandas.DataFrame(my_mm, copy=False)
    print """calculate mean of pandas DataFrame df, monitor memory using top.  Both this process and the system usage will increase.  
        When the calculation finishes, system usage will return to baseline"""
    my_df_mean = df.mean()
    print "my_df_mean:  {}".format(my_df_mean)
    raw_input("pause here before exiting")
4

0 回答 0