0

我正在尝试分配一个比 RAM 更大的大型内存映射二维数组,并且由于内存不足错误而不断失败。我正在使用 java8、linux-amd64 和 nd4j 1.0.0-beta4。根据文档(https://deeplearning4j.org/docs/latest/deeplearning4j-config-memory),我的理解是我应该能够分配一个比 RAM 大得多的数组,因为它将使用临时文件和然后依靠操作系统根据需要进行分页(例如使用 mmap)

更新 - 重新启动后我得到了一些间歇性的成功 - 我想知道大型阵列分配例程是否需要一些基本的 RAM 量?也许归零?我回来汇报...

我尝试了一些不同的策略选项,确保临时文件所在的磁盘有足够的空闲空间,并进行了一些调试以查看内存分配代码的内部情况,但无济于事。抱怨没有足够的物理内存似乎总是失败 - 这是正确的,没有足够的内存来做到这一点,这就是重点

long cols = 3000;
long rows = 1000000;

long expectedSize = 4 * cols * rows;
this.nd4jWorkspaceManager = Nd4j.getWorkspaceManager();
this.workspaceConfig  = WorkspaceConfiguration.builder()
    .initialSize(expectedSize)
    .policyLocation(LocationPolicy.MMAP)
    .policyAllocation(AllocationPolicy.OVERALLOCATE)
    .policySpill(SpillPolicy.EXTERNAL)
    .tempFilePath(System.getProperty("user.home") + "/.nd4jtmp")
    .build();

System.out.format("Attempting to create workspace of size %s%n", formatBytes(expectedSize));
this.memoryWorkspace = Nd4j.getWorkspaceManager().getAndActivateWorkspace(workspaceConfig, "M2");
System.out.println("... Done");

System.out.format("Attempting to create array of size %s%n", formatBytes(expectedSize));
INDArray matrix = Nd4j.create(DataType.FLOAT, rows, cols);
System.out.println("... Done");

System.out.format("Populating array with random numbers...%n");

for (int i = 0; i < rows; i++) {
  for (int j = 0; j < cols; j++) {
    matrix.put(i, j, (float) Math.random());
  } 
}

System.out.println("... Done");

这是输出到free,

$ free
              total        used        free      shared  buff/cache   available
Mem:        7852420     2950656      120860      311200     4780904     4067768
Swap:       7811068      190632     7620436

我运行 main 方法,但无法分配数组:

09:14:36,476 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
09:14:36,477 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
09:14:36,477 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [file:/home/nickg/src/riskscape/riskscape/cli/bin/main/logback.xml]
09:14:36,478 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.
09:14:36,478 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [file:/home/nickg/src/riskscape/riskscape/test-shared/bin/main/logback.xml]
09:14:36,478 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [file:/home/nickg/src/riskscape/riskscape/cli/bin/main/logback.xml]
09:14:36,786 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
09:14:36,790 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
09:14:36,797 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDERR]
09:14:36,909 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [nz.org.riskscape] to WARN
09:14:36,909 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDERR] to Logger[nz.org.riskscape]
09:14:36,909 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to ERROR
09:14:36,910 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDERR] to Logger[ROOT]
09:14:36,910 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
09:14:36,912 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6950e31 - Registering current configuration as safe fallback point

Attempting to create workspace of size 11.18gb
... Done
Attempting to create array of size 11.18gb
Exception in thread "main" java.lang.OutOfMemoryError: Cannot allocate new LongPointer(8): totalBytes = 1, physicalBytes = 3779M
    at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:76)
    at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:41)
    at org.nd4j.linalg.api.buffer.BaseDataBuffer.<init>(BaseDataBuffer.java:407)
    at org.nd4j.linalg.api.buffer.LongBuffer.<init>(LongBuffer.java:81)
    at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createLong(DefaultDataBufferFactory.java:478)
    at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createLong(DefaultDataBufferFactory.java:473)
    at org.nd4j.linalg.factory.Nd4j.createBufferDetached(Nd4j.java:1449)
    at org.nd4j.linalg.api.shape.Shape.createShapeInformation(Shape.java:3241)
    at org.nd4j.linalg.api.ndarray.BaseShapeInfoProvider.createShapeInformation(BaseShapeInfoProvider.java:76)
    at org.nd4j.linalg.cpu.nativecpu.DirectShapeInfoProvider.createShapeInformation(DirectShapeInfoProvider.java:65)
    at org.nd4j.linalg.cpu.nativecpu.DirectShapeInfoProvider.createShapeInformation(DirectShapeInfoProvider.java:49)
    at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:232)
    at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:343)
    at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:185)
    at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.create(CpuNDArrayFactory.java:189)
    at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4651)
    at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4129)
    at NDArrayAllocationTest.run(NDArrayAllocationTest.java:40)
    at NDArrayAllocationTest.main(NDArrayAllocationTest.java:14)
Caused by: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (3779M) > maxPhysicalBytes (3410M)
    at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:585)
    at org.bytedeco.javacpp.Pointer.init(Pointer.java:125)
    at org.bytedeco.javacpp.LongPointer.allocateArray(Native Method)
    at org.bytedeco.javacpp.LongPointer.<init>(LongPointer.java:68)
    ... 18 more

4

1 回答 1

0

看来这只是家务事。我确认 mmaped 文件正在用于我的 NDArray,在为数组的形状分配一些缓冲区时发生 OOM。将 org.bytedeco.javacpp.maxphysicalbytes 设置为足够大后,NDArray 构建成功。

我不太确定为什么这样做以及为什么它是必要的,但我们开始了。它未能分配的长缓冲区的长度只有大约 8 个长...... - 也许 mmap 的文件歪曲了进程报告的内存大小?

如果有对ND4J的内存管理有更多了解的可以评论,欢迎评论。

于 2019-08-06T11:11:29.167 回答