linux - NUMA 感知缓存对齐内存分配

Question

在 linux 系统中，pthreads 库为我们提供了一个用于缓存对齐的函数（posix_memalign），以防止错误共享。要选择架构的特定 NUMA 节点，我们可以使用 libnuma 库。我想要的是两者都需要的东西。我将某些线程绑定到某些特定处理器，并且我想从相应的 NUMA 节点为每个线程分配本地数据结构，以减少线程内存操作的延迟。我怎样才能做到这一点？

score 12 · Accepted Answer

libnuma 中的 numa_alloc_*() 函数分配整个内存页，通常为 4096 字节。高速缓存行通常为 64 字节。由于 4096 是 64 的倍数，因此从 numa_alloc_*() 返回的任何内容都将在缓存级别进行内存对齐。

但是请注意 numa_alloc_*() 函数。它在手册页上说它们比相应的 malloc() 慢，我确信这是真的，但我发现更大的问题是同时分配来自 numa_alloc_*() 的同时在许多内核上运行遭受大量的争用问题。在我的情况下，用 numa_alloc_onnode() 替换 malloc() 是一种清洗（我通过使用本地内存获得的一切都被增加的分配/空闲时间所抵消）；tcmalloc 比任何一个都快。我一次在 32 个线程/内核上执行了数千个 12-16kb 的 malloc。时序实验表明，不是 numa_alloc_onnode() 的单线程速度导致我的进程花费大量时间执行分配，这可能导致锁定/争用问题。我的解决方案我们采用的是numa_alloc_onnode() 大块内存一次，然后根据需要分配给每个节点上运行的线程。我使用 gcc atomic builtins 来允许每个线程（我将线程固定到 cpus）从每个节点上分配的内存中获取。如果你愿意，你可以缓存行大小对齐分布，如果你愿意：我愿意。这种方法甚至击败了 tcmalloc（线程感知但不感知 NUMA - 至少 Debain Squeeze 版本似乎不是）。这种方法的缺点是你不能释放单独的分布（好吧，无论如何，不是没有更多的工作），你只能释放整个底层的节点分配。但是，如果这是用于函数调用的临时节点暂存空间，或者您可以准确指定何时不再需要该内存，那么这种方法效果很好。如果您也可以预测每个节点上需要分配多少内存，这显然会有所帮助。

@nandu：我不会发布完整的源代码——它很长，而且在与我所做的其他事情相关的地方，这使得它不太透明。我将发布的是我的新 malloc() 函数的一个稍微缩短的版本，以说明核心思想：

void *my_malloc(struct node_memory *nm,int node,long size)
{
  long off,obytes;

  // round up size to the nearest cache line size
  // (optional, though some rounding is essential to avoid misalignment problems)

  if ((obytes = (size % CACHE_LINE_SIZE)) > 0)
    size += CACHE_LINE_SIZE - obytes;

  // atomically increase the offset for the requested node by size

  if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) {
    fprintf(stderr,"Out of allocated memory on node %d\n",node);
    return(NULL);
  }
  else
    return((void *) (nm->ptr[node] + off));

}

struct node_memory 在哪里

struct node_memory {
  long bytes;         // the number of bytes of memory allocated on each node
  char **ptr;         // ptr array of ptrs to the base of the memory on each node
  long *off;          // array of offsets from those bases (in bytes)
  int nptrs;          // the size of the ptr[] and off[] arrays
};

nm->ptr[node] 是使用 libnuma 函数 numa_alloc_onnode() 设置的。

我通常也在结构中存储允许的节点信息，因此 my_malloc() 可以检查节点请求是否合理，而无需进行函数调用；我还检查了 nm 是否存在，并且该大小是合理的。函数 __sync_fetch_and_add() 是 gcc 内置的原子函数；如果您不使用 gcc 进行编译，则需要其他内容。我使用原子，因为在我有限的经验中，它们在高线程/核心计数条件下比互斥锁快得多（如在 4P NUMA 机器上）。

score 8 · Accepted Answer

如果您只是想获得围绕 NUMA 分配器的对齐功能，您可以轻松构建自己的。

malloc()这个想法是用更多的空间来调用未对齐的。然后返回第一个对齐的地址。为了能够释放它，您需要将基地址存储在已知位置。

这是一个例子。只需用适当的名称替换名称：

pint         //  An unsigned integer that is large enough to store a pointer.
NUMA_malloc  //  The NUMA malloc function
NUMA_free    //  The NUMA free function

void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){

    //  The NUMA malloc function
    void *ptr = numa_malloc(
        (size_t)(bytes + align + sizeof(pint)),
        /* NUMA parameters */
    );

    if (ptr == NULL)
        return NULL;

    //  Get aligned return address
    pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);

    //  Save the free pointer
    ret[-1] = (pint)ptr;

    return ret;
}

void my_NUMA_free(void *ptr){
    if (ptr == NULL)
        return;

    //  Get the free pointer
    ptr = (void*)(((pint*)ptr)[-1]);

    //  The NUMA free function
    numa_free(ptr); 
}

当你使用它时，你需要调用my_NUMA_free任何分配给my_NUMA_malloc.

linux - NUMA 感知缓存对齐内存分配

2 回答 2

Related

Reference