2

I am developing an R package called biglasso that fits lasso models in R for massive data sets by using memory-mapping techniques implemented in bigmemory C++ library. Specifically, for a very large dataset (say 10GB), a file-backed big.matrix is first created with memory-mapped files stored on disk. Then the model fitting algorithm accesses the big.matrix via MatrixAccessor defined in the C++ library to obtain data for computation. I assume that memory-mapping technique allows to work on data that larger than available RAM, as mentioned in the bigmemory paper.

For my package, everything works great at this point if the data size doesn't exceed available RAM. However, the code runs like forever when the data is larger than RAM, no complains, no errors, no stop. On Mac, I checked top command, and noticed that the status of this job kept switching between "sleeping" and "running", I am not sure what this means or this indicates something going on.

[EDIT:]

By "cannot finish", "run forever", I mean that: working on 18 GB of data with 16 GB RAM cannot finish for over 1.5 hours, but it could be done within 5 minutes if with 32 GB RAM.

[END EDIT]

Questions:

(1) I basically understand memory-mapping utilizes virtual memory so that it can handle data larger than RAM. But how much memory does it need to deal with larger-than-RAM objects? Is there a upper bound? Or is it decided by size of virtual memory? Since the virtual memory is like infinite (constrained by hard drive), would that mean memory-mapping approach can handle data much much larger than physical RAM?

(2) Is there a way that I can measure the memory used in physical RAM and the virtual memory used, separately?

(3) Are there anything that I am doing wrong? what are possible reasons for my problems here?

Really appreciate any feedback! Thanks in advance.


Below are some details of my experiments on Mac and Windows and related questions.

  1. On Mac OS: Physical RAM: 16GB; Testing data: 18GB. Here is the screenshot of memory usage. The code cannot finish.

enter image description here

[EDIT 2]

enter image description here

I attached CPU usage and history here. There is just use one single core for the R computation. It's strange that the system uses 6% CPU, while User just uses 3%. And from the CPU history window, there are a lot of red area.

Question: What does this suggest? Now I suspect it is the CPU cache is filled up. Is that right? If so, how could I resolve this issue?

[END EDIT 2]

Questions:

(4) As I understand, "memory" column shows the memory used in physical RAM, while "real memory" column shows the total memory usage, as pointed out here. Is that correct? The memory used always shows ~ 2GB, so I don't understand why so much memory in RAM is not used.

(5) A minor question. As I observed, it seems that "memory used" + "Cache" must be always less than "Physical memory" (in the bottom middle part). Is this correct?


  1. On Windows machine: Physical RAM: 8GB; Testing data: 9GB. What I observed was that as my job started, the memory usage kept increasing until hitting the limit. The job cannot finish as well. I also tested functions in biganalytics package (also using bigmemory), and found the memory blows up too.

enter image description here

4

2 回答 2

2

“无法完成”在这里是模棱两可的。如果您等待足够长的时间,您的计算可能会完成。当您使用虚拟内存时,您需要在磁盘上分页和分页,这比将其保存在 RAM 中要慢数千到数百万倍。您将看到的减速取决于您的算法如何访问内存。如果您的算法仅按固定顺序访问每个页面一次,则可能不会花费太长时间。如果您的算法在您的数据结构周围跳跃 O(n^2) 次,那么分页将使您的速度减慢很多,可能无法完成。

于 2016-03-11T23:24:43.400 回答
2

在 Windows 中,检查 TaskManager -> Performance-> Resourcemonitor -> Disk activity 以查看进程 ID 将多少数据写入磁盘可能很有用。它可以让您了解从 RAM 到虚拟内存的数据量,如果写入速度正在成为瓶颈等

于 2016-03-11T23:46:25.517 回答