I am developing an R package called biglasso that fits lasso models in R for massive data sets by using memory-mapping techniques implemented in bigmemory C++ library. Specifically, for a very large dataset (say 10GB), a file-backed big.matrix
is first created with memory-mapped files stored on disk. Then the model fitting algorithm accesses the big.matrix
via MatrixAccessor defined in the C++ library to obtain data for computation. I assume that memory-mapping technique allows to work on data that larger than available RAM, as mentioned in the bigmemory paper.
For my package, everything works great at this point if the data size doesn't exceed available RAM. However, the code runs like forever when the data is larger than RAM, no complains, no errors, no stop. On Mac, I checked top
command, and noticed that the status of this job kept switching between "sleeping" and "running", I am not sure what this means or this indicates something going on.
[EDIT:]
By "cannot finish", "run forever", I mean that: working on 18 GB of data with 16 GB RAM cannot finish for over 1.5 hours, but it could be done within 5 minutes if with 32 GB RAM.
[END EDIT]
Questions:
(1) I basically understand memory-mapping utilizes virtual memory so that it can handle data larger than RAM. But how much memory does it need to deal with larger-than-RAM objects? Is there a upper bound? Or is it decided by size of virtual memory? Since the virtual memory is like infinite (constrained by hard drive), would that mean memory-mapping approach can handle data much much larger than physical RAM?
(2) Is there a way that I can measure the memory used in physical RAM and the virtual memory used, separately?
(3) Are there anything that I am doing wrong? what are possible reasons for my problems here?
Really appreciate any feedback! Thanks in advance.
Below are some details of my experiments on Mac and Windows and related questions.
- On Mac OS: Physical RAM: 16GB; Testing data: 18GB. Here is the screenshot of memory usage. The code cannot finish.
[EDIT 2]
I attached CPU usage and history here. There is just use one single core for the R computation. It's strange that the system uses 6% CPU, while User just uses 3%. And from the CPU history window, there are a lot of red area.
Question: What does this suggest? Now I suspect it is the CPU cache is filled up. Is that right? If so, how could I resolve this issue?
[END EDIT 2]
Questions:
(4) As I understand, "memory" column shows the memory used in physical RAM, while "real memory" column shows the total memory usage, as pointed out here. Is that correct? The memory used always shows ~ 2GB, so I don't understand why so much memory in RAM is not used.
(5) A minor question. As I observed, it seems that "memory used" + "Cache" must be always less than "Physical memory" (in the bottom middle part). Is this correct?
- On Windows machine: Physical RAM: 8GB; Testing data: 9GB. What I observed was that as my job started, the memory usage kept increasing until hitting the limit. The job cannot finish as well. I also tested functions in biganalytics package (also using bigmemory), and found the memory blows up too.