c - C: memcpy speed on dynamically allocated arrays

Question

I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:

int main()
{
  double *a, *b;
  unsigned n = 10000000, i;
  a = malloc(n*sizeof(double));
  b = malloc(n*sizeof(double));
  for(i=0; i<n; i++) {
    a[i] = 1.0;
    /* b[i] = 0.0; */
  }

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero1");

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero2");

  tic();
  memcpy(b, a, n*sizeof(double));
  toc("memcpy");
}

tic/toc measure the execution time.

On my computer it takes 0.035s to memcpy (Linux, gcc version 4.4.6). If I now uncomment the line which initializes the destination array b, the code is three times faster (!) - 0.011s.

I have observed similar behavior when using a loop instead of memcpy. Usually I do not care about this since it is enough to 'initialize' the memory before using it. However, I now need to perform a simple memory copy, and do it as fast as possible. Initializing the data requires writing e.g. 0 to the memory, which is not necessary and takes time. And I would like to perform a memory copy with all available memory bandwidth.

Is there a solution to this problem? Or is it connected to the way Linux handles dynamic memory (some sort of lazy page allocation?) and can not be worked around? How is it on other systems?

Edit: The same results are obtained with gcc 4.6. I used -O3 to compile.

Edit: Thank you all for your comments. I do understand that memory mapping takes time. I guess I just have a hard time accepting that it takes so long, much longer than the actual memory access. The code has been modified to include a benchmark of the initialization of array b using two subsequent bzero calls. The timings now show

bzero1 0.273981
bzero2 0.056803
memcpy 0.117934

Clearly, the first bzero call does much more than just stream zeros to memory - that is memory mapping and memory zeroing. The second bzero call on the other hand takes half of the time required to do a memcpy, which is exactly as expected - write only time vs. read and write time. I understand that the overhead of the second bzero call must be there due to OS security reasons. What about the rest? Can I not decrease it somehow, e.g. use larger memory pages? Different kernel settings?

I should mention that I run this on Ubuntu wheeze.

score 10 · Accepted Answer

这可能是惰性页面分配，Linux 仅在首次访问时映射页面。IIRC Linux 中新块中的每一页都是空白页的写时复制，并且您的分配足够大以需要新块。

如果你想解决它，你可以以 4k 的间隔写入一个字节或一个字。这可能会使映射到 RAM 的虚拟地址比写入整个每个页面的速度稍快一些。

不过，我不希望（强制延迟内存映射发生的最有效的解决方法）加上（复制）比没有初始化的（复制）要快得多b。因此，除非有特定原因您只关心副本的性能，而不是整个操作，否则它是徒劳的。它是“现在付款或以后付款”，Linux 以后付款，您只是在衡量以后的时间。

score 6 · Accepted Answer

当然，如果您将初始化和复制的速度与仅复制的速度进行比较，那么初始化应该包含在定时部分中。在我看来，您实际上应该对此进行比较：

// Version 1
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

memcpy(b, a, n*sizeof(double));

toc();

对此：

// Version 2
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

for(i=0; i<n; i++)
    b[i] = 0.0;
memcpy(b, a, n*sizeof(double));

toc();

我预计这将使您的 3 倍速度提升急剧下降。

编辑：正如史蒂夫·杰索普（Steve Jessop）所建议的，您可能还想测试第三种策略，即每页只触摸一个条目：

// Version 3
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

for(i=0; i<n; i+=DOUBLES_PER_PAGE)
    b[i] = 0.0;
memcpy(b, a, n*sizeof(double));

toc();

score 5 · Accepted Answer

由于 (1) 惰性页面分配和 (2) 内核的惰性页面零初始化，第一个 bzero 运行时间更长。虽然第二个原因由于安全原因是不可避免的，但可以通过使用更大（“巨大”）页面来优化惰性页面分配。

在 Linux 上至少有两种使用大页面的方法。困难的方式是hugetlbfs。简单的方法是透明大页面。

在系统上的进程列表中搜索khugepaged。如果存在这样的过程，则支持透明大页面，如果您更改为以下内容，则可以在应用程序中使用它们malloc：

posix_memalign((void **)&b, 2*1024*1024, n*sizeof(double));
madvise((void *)b, n*sizeof(double), MADV_HUGEPAGE);

c - C: memcpy speed on dynamically allocated arrays

3 回答 3

Related

Reference