0

本周我一直试图发现一个不寻常的内存行为:当我用一个线程运行我的代码时,我有一个特定的内存占用,但如果我用多个线程运行它,那么内存占用会增加很多,而没有明显的原因。

我知道由于多线程程序的不确定性可能会发生奇怪的事情,因此最好使用一些确定性的方法来调试它。然而,即使在这种非常确定的情况下,我也看不出为什么内存占用会增加。

这是我能够编写的重现问题的最小程序:

测试.cpp

#include <chrono>
#include <iomanip>
#include <iostream>
#include <openssl/sha.h>
#include <sparsehash/sparse_hash_map>
#include <thread>

struct mystruct
{
    uint64_t values[16];
    inline bool operator==(const mystruct &other) const {
        return !memcmp(values, other.values, sizeof(values));
    }
    inline bool operator!=(const mystruct &other) const { 
        return !(*this == other); 
    }
};

namespace std {
    template<>
    class hash<mystruct> {
    public:
        inline size_t operator()(const mystruct &s) const 
        {
            return s.values[0];
        }
    };
}

google::sparse_hash_map<mystruct, uint64_t, std::hash<mystruct>> h;

const uint64_t N = 1ull * 1024ull * 1024ull * 1024ull;
const int step = 128;
volatile uint64_t next = 0;
uint64_t total_keys = 0;
const uint64_t THREADS = 1;

void insert_data(uint64_t data) {
    unsigned char buf[256];
    mystruct b;
    for (uint64_t j = 0; j < N / THREADS; j++) {
        SHA256((unsigned char*)&data, 8, buf);
        memcpy(&b, buf, sizeof(mystruct));

        while (next != data)
            ;

        h[b] = 1;
        total_keys++;
        next += step;
        data += THREADS * step;
    }
}

int main() {
    std::thread t[THREADS];

    for (uint64_t i = 0; i < THREADS; i++) {
        t[i] = std::thread(insert_data, i * step);
    }   

    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    while (total_keys < N) {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));

        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        uint64_t elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
        double speed = 1000.0*(double)total_keys / (double)elapsed_ms ;

        std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)"  << ", next key is " << next << std::endl;
    }

    for (uint64_t i = 0; i < THREADS; i++) {
        t[i].join();
    }
}

该程序的想法非常简单:

  • 使用循环将已知且预定的 1 亿个键序列插入哈希表
  • 观看程序每 100 毫秒在控制台上转储它到目前为止插入了多少键,以及将要插入的下一个键是什么
  • 等到操作系统杀死程序

因此,目标是验证对于预定义的插入序列,无论有多少线程将数据推送到哈希表中,内存消耗都是相同的,即插入键的数量大致相同(我不介意如果有几个关键的差异,例如由于操作系统决定终止应用程序)。那么差异可能是巨大的。

insert_data函数使用一个简单的自旋循环来保证插入顺序,即数字 0、128、256 等等……(使用锁/互斥体没有区别)。

我在以下环境中运行该程序:

  • 具有 4GB 内存的 Debian 8.3 Linux VM
  • 使用swapoff -a没有交换内存
  • -O3带有标志的 GCC 4.9.2

这些是上述代码以不同数​​量的THREADS变量运行时的结果:

线程 = 1

Processed 54241 items in 100 ms (542410 keys/s), next key is 6942848
Processed 104857 items in 200 ms (524285 keys/s), next key is 13421696
...
Processed 13410458 items in 35698 ms (375664 keys/s), next key is 1716538624
Processed 13421773 items in 35799 ms (374920 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 41245 ms (325416 keys/s), next key is 1717986944
Processed 13438143 items in 41345 ms (325025 keys/s), next key is 1720082432
...
Processed 23580346 items in 65567 ms (359637 keys/s), next key is 3018284416
Killed

线程 = 2

Processed 69922 items in 101 ms (692297 keys/s), next key is 8950016
Processed 121766 items in 202 ms (602802 keys/s), next key is 15586048
...
Processed 13409271 items in 37098 ms (361455 keys/s), next key is 1716386688
Processed 13421773 items in 37198 ms (360820 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 43204 ms (310660 keys/s), next key is 1717986944
Processed 13443021 items in 43304 ms (310434 keys/s), next key is 1720706688
...
Processed 20024294 items in 64129 ms (312250 keys/s), next key is 2563110656
Killed

线程 = 4

Processed 31 items in 100 ms (310 keys/s), next key is 3968
Processed 33882 items in 200 ms (169410 keys/s), next key is 4336896
...
Processed 13399262 items in 48949 ms (273739 keys/s), next key is 1715105536
Processed 13421773 items in 49049 ms (273640 keys/s), next key is 1717986944
...
... 5 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 54091 ms (248133 keys/s), next key is 1717986944
Processed 13421773 items in 54201 ms (247630 keys/s), next key is 1717986944
Killed

如您所见,所有三个测试都将特定键标识1717986944为触发哈希表调整大小的键,并且所有这些都发生在同一个插入键 n 处。13421773,这证实了无论运行线程的数量如何,插入总是严格按照相同的顺序发生。

但是,THREADS=1THREADS=2多插入了 3556052 个键(对应于 434MB),并且比THREADS=4多插入了 6602521 个键(对应于 805MB) 。

谁能解释我为什么即使在如此确定的条件下我也会有如此奇怪的内存消耗?我真的错过了什么吗?

4

1 回答 1

0

发布此作为答案,因为我能够理解发生了什么。

经过大量时间和调试后,我发现内存消耗的根本原因是产生堆碎片的病态内存分配/释放模式。没有内存泄漏或任何东西。我也能够通过自定义哈希表实现重现该问题。

尽管很奇怪,但在这两种情况下malloc_trim(0);,在主循环的末尾都在行之后添加 a

std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)"  << ", next key is " << next << std::endl;

补偿一点并允许程序回收更多内存。

于 2017-03-01T17:07:27.247 回答