performance - 执行存储在数据段中的 x86 指令的性能损失？

Question

我有一个简单的程序，它首先将一些本机 x86 指令写入声明的缓冲区，然后设置一个指向该缓冲区的函数指针并进行调用。但是，当该缓冲区分配在堆栈上（而不是在堆上，甚至在全局数据区域中）时，我注意到严重的性能损失。我验证了数据缓冲区中指令序列的开始位于 16 字节边界上（我假设这是 cpu 需要（或希望）它是什么）。我不知道为什么我在这个过程中执行指令会有所不同，但在下面的程序中，“GOOD”在我的双核工作站上执行 4 秒，“BAD”需要 6 分钟左右. 这里是否存在某种对齐/i-cache/预测问题？我的 VTune 评估许可证刚刚结束，

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef int (*funcPtrType)(int, int);

int foo(int a, int b) { return a + b; }

void main()
{
  // Instructions in buf are identical to what the compiler generated for "foo".
  char buf[201] = {0x55,
                   0x8b, 0xec,
                   0x8b, 0x45, 0x08,
                   0x03, 0x45, 0x0c,
                   0x5D,
                   0xc3
                  };

  int i;

  funcPtrType ptr;

#ifdef GOOD
  char* heapBuf = (char*)malloc(200);
  printf("Addr of heap buf: %x\n", &heapBuf[0]);
  memcpy(heapBuf, buf, 200);
  ptr = (funcPtrType)(&heapBuf[0]);
#else // BAD
  printf("Addr of local buf: %x\n", &buf[0]);
  ptr = (funcPtrType)(&buf[0]);
#endif

  for (i=0; i < 1000000000; i++)
    ptr(1,2);
}

运行结果如下：

$ cl -DGOOD ne3.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
版权所有 (C) Microsoft Corp 1984-1997。版权所有。

ne3.cpp
Microsoft (R) 32 位增量链接器版本 5.10.7303
版权所有 (C) Microsoft Corp 1992-1997。版权所有。

/out:ne3.exe
ne3.obj
$ time ./ne3
堆缓冲区地址：410eb0

real 0m 4.33s
user 0m 4.31s
sys 0m 0.01s
$
$
$ cl ne3.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 11.00.7022 for 80x86
版权所有 (C) Microsoft Corp 1984-1997。版权所有。

ne3.cpp
Microsoft (R) 32 位增量链接器版本 5.10.7303
版权所有 (C) Microsoft Corp 1992-1997。版权所有。

/out:ne3.exe
ne3.obj
$ time ./ne3
本地 buf 地址：12feb0

真正的 6m41.19s
用户 6m40.46s
sys 0m 0.03s
$

谢谢。

沙桑克

score 3 · Accepted Answer

Stack protection for security?

As a wild guess, you could be running into an MMU-based stack protection scheme. A number of security holes were based on deliberate buffer overruns, which inject executable code onto the stack. One way to fight these is with a non-executable stack. This would result in a trap into the OS, where I suppose it's possible that the OS or some virus SW does something.

Negative i-cache coherency interaction?

Another possibility is that using both code and data accesses to nearby addresses is defeating the CPU cache strategy. I believe x86 implements an essentially automatic code/data coherency model, which is likely to result in the invalidation of large amounts of nearby cached instructions on any memory write. You can't really fix this by changing your program to not use the stack (obviously you can move the dynamic code) because the stack is written by the machine code all the time, for example, whenever a parameter or return address is pushed for a procedure call.

The CPU's are really fast these days relative to the DRAM or even the outer level cache rings, so anything that defeats the inner cache rings will be quite serious, plus its implementation probably involves some sort of micro-trap within the CPU implementation, followed by a "loop" in HW to invalidate things. It isn't something Intel or AMD would have worried about speed on, since for most programs it would never happen and when it did it would normally only happen once after loading a program.

score 2 · Accepted Answer

My guess is that, since you have the variable i on the stack also, when you change i in your for loop, you trash the same cache line that the code is sitting in. Put the code in the middle of your buffer somewhere (and perhaps enlarge the buffer) so as to keep it separated from the other stack variables.

Also note that execution of instructions on the stack is usually the hallmark of a security exploit (such as a buffer overrun) being exploited.

Therefore the OS is often configured to disallow this behaviour. Virus scanners may take action against it as well. Perhaps your program is running through a security check each time it tries to access that stack page (though I'd expect the sys time field to be larger in that case).

If you want to "officially" make a memory page executable, you should probably look into VirtualProtect().

performance - 执行存储在数据段中的 x86 指令的性能损失？

2 回答 2

Stack protection for security?

Negative i-cache coherency interaction?

Related

Reference