c++ - 内存感知和大数据

Question

我目前正在开发一个使用大量文本的项目（数百 MB 到几 GB 的文本 - DBpedia 数据集）。为了节省空间，我将字符串映射到数字并仅在需要打印内容时才使用字符串。为了加快处理数据的算法，我设计了一个Cache用作键值缓存的类。当然，问题是当程序运行较长时间时，缓存会变得非常大。

我目前管理它的方式是将缓存限制为特定数量的条目。该解决方案有效，但效果不佳。一种更灵活的方法是对所有缓存设置一些内存限制，当达到限制时，禁用缓存甚至清空一些缓存，具体取决于它们的重要性和大小。

我正在考虑实现一个 sizeB() 方法，该方法将以字节为单位返回缓存大小，以便每个实例都可以报告它正在使用多少内存。但这当然不能解决何时停止缓存的问题……我可能不得不手动跟踪所有内存使用情况。也许一些CacheFactory所有缓存都已注册并且在达到限制时也被清空的单身人士？

我想知道是否有一些“标准”技术可以做这样的事情。我应该搜索任何成语/模式吗？

此外，最好自己跟踪内存使用情况（似乎更便携但也更费力）或使用一些技术，如在 linux 上读取 /prco/pid 等。

score 1 · Accepted Answer

Yes, there are standard techniques for caching and memory rebalancing. The simplest approach would follow what you're thinking of doing - create a cache 'factory' or a 'manager'. It would allocate cache objects on demand, each objects having a size limit (think of it as a CPU cache line, which has a preset size of 64 bytes). Knowing only the number of cache objects allocated the manager would be able to roughly estimate the amount of used memory and compare it to a total_max_limit which it would know based on the machine it runs on and the type of the OS and so on. So when the total_max_limit is hit and some cache objects need to be freed, the most commonly used approach is LRU (choosing a least recently used cache object to destroy). To implement this you would store pointers to cache objects inside the manager in a deque and when an cache object gets accessed it tells the manager (through the pointer in the cache object structure) to 'mark-as-accessed' meaning to move the pointer to this cache object to the front of the deque. This means that the last pointer in the deque (the *tail) references the least recently used cache object. And factory.rebalance() would just pop_back and free the object returned.

There're other algorithms, but LRU is the most commonly used one. Priority caching could be implemented using it as well. What you'd want is to create several 'cache managers' and distribute their total_max_limits so that the higher priority one get more memory and lower priority ones get less and less and less, what you'll get as a result is that lower priority stuff will be evicted faster and more higher priority stuff will reside in memory/cache. This approach might perform better than calculating some weights-based formula each time for each cache to choose how far from the head of the deque this particular cache object should be moved to.

c++ - 内存感知和大数据

1 回答 1

Related

Reference