我已经进行了分析,现在我希望从我的热点中挤出所有可能的性能。
我知道[MethodImplOptions.AggressiveInlining]和ProfileOptimization 类。还有其他人吗?
[编辑] 我刚刚也发现了[TargetedPachingOptOut]。没关系,显然不需要那个。
我已经进行了分析,现在我希望从我的热点中挤出所有可能的性能。
我知道[MethodImplOptions.AggressiveInlining]和ProfileOptimization 类。还有其他人吗?
[编辑] 我刚刚也发现了[TargetedPachingOptOut]。没关系,显然不需要那个。
是的,还有更多技巧:-)
实际上,我已经对优化 C# 代码进行了大量研究。到目前为止,这些是最重要的结果:
IEquatable<T>
通常是一个糟糕的计划 - 所以如果你使用 f.ex。一个哈希,一定要实现正确的重载和接口,因为它会保护你大量的性能。Foo[]
,但Foo[][]
通常比Foo[,]
.以前还有一个名为“英特尔奔腾处理器优化”的指南,其中包含大量技巧(例如移位或乘法而不是除法)。虽然编译器现在做得很好,但这有时也有一点帮助。
当然这些只是优化;最大的性能提升通常是更改算法和/或数据结构的结果。请务必检查您可以使用哪些选项,并且不要过多地使用 .NET 框架来限制自己……此外,在我自己检查了反编译的代码之前,我有一种不信任 .NET 实现的自然倾向。 .. 有很多东西可以更快地实施(大多数时候有充分的理由)。
高温高压
亚历克斯向我指出,Array.Copy
根据某些人的说法,这实际上更快。由于我真的不知道这些年来发生了什么变化,我决定唯一正确的做法是创建一个全新的基准并对其进行测试。
如果你只对结果感兴趣,请往下走。在大多数情况下,调用Buffer.BlockCopy
明显优于Array.Copy
. 在 .NET 4.5.2 上在具有 16 GB 内存(>10 GB 可用)的 Intel Skylake 上进行测试。
代码:
static void TestNonOverlapped1(int K)
{
long total = 1000000000;
long iter = total / K;
byte[] tmp = new byte[K];
byte[] tmp2 = new byte[K];
for (long i = 0; i < iter; ++i)
{
Array.Copy(tmp, tmp2, K);
}
}
static void TestNonOverlapped2(int K)
{
long total = 1000000000;
long iter = total / K;
byte[] tmp = new byte[K];
byte[] tmp2 = new byte[K];
for (long i = 0; i < iter; ++i)
{
Buffer.BlockCopy(tmp, 0, tmp2, 0, K);
}
}
static void TestOverlapped1(int K)
{
long total = 1000000000;
long iter = total / K;
byte[] tmp = new byte[K + 16];
for (long i = 0; i < iter; ++i)
{
Array.Copy(tmp, 0, tmp, 16, K);
}
}
static void TestOverlapped2(int K)
{
long total = 1000000000;
long iter = total / K;
byte[] tmp = new byte[K + 16];
for (long i = 0; i < iter; ++i)
{
Buffer.BlockCopy(tmp, 0, tmp, 16, K);
}
}
static void Main(string[] args)
{
for (int i = 0; i < 10; ++i)
{
int N = 16 << i;
Console.WriteLine("Block size: {0} bytes", N);
Stopwatch sw = Stopwatch.StartNew();
{
sw.Restart();
TestNonOverlapped1(N);
Console.WriteLine("Non-overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
GC.Collect(GC.MaxGeneration);
GC.WaitForFullGCComplete();
}
{
sw.Restart();
TestNonOverlapped2(N);
Console.WriteLine("Non-overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
GC.Collect(GC.MaxGeneration);
GC.WaitForFullGCComplete();
}
{
sw.Restart();
TestOverlapped1(N);
Console.WriteLine("Overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
GC.Collect(GC.MaxGeneration);
GC.WaitForFullGCComplete();
}
{
sw.Restart();
TestOverlapped2(N);
Console.WriteLine("Overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);
GC.Collect(GC.MaxGeneration);
GC.WaitForFullGCComplete();
}
Console.WriteLine("-------------------------");
}
Console.ReadLine();
}
x86 JIT 上的结果:
Block size: 16 bytes
Non-overlapped Array.Copy: 4267.52 ms
Non-overlapped Buffer.BlockCopy: 2887.05 ms
Overlapped Array.Copy: 3305.01 ms
Overlapped Buffer.BlockCopy: 2670.18 ms
-------------------------
Block size: 32 bytes
Non-overlapped Array.Copy: 1327.55 ms
Non-overlapped Buffer.BlockCopy: 763.89 ms
Overlapped Array.Copy: 2334.91 ms
Overlapped Buffer.BlockCopy: 2158.49 ms
-------------------------
Block size: 64 bytes
Non-overlapped Array.Copy: 705.76 ms
Non-overlapped Buffer.BlockCopy: 390.63 ms
Overlapped Array.Copy: 1303.00 ms
Overlapped Buffer.BlockCopy: 1103.89 ms
-------------------------
Block size: 128 bytes
Non-overlapped Array.Copy: 361.18 ms
Non-overlapped Buffer.BlockCopy: 219.77 ms
Overlapped Array.Copy: 620.21 ms
Overlapped Buffer.BlockCopy: 577.20 ms
-------------------------
Block size: 256 bytes
Non-overlapped Array.Copy: 192.92 ms
Non-overlapped Buffer.BlockCopy: 108.71 ms
Overlapped Array.Copy: 347.63 ms
Overlapped Buffer.BlockCopy: 353.40 ms
-------------------------
Block size: 512 bytes
Non-overlapped Array.Copy: 104.69 ms
Non-overlapped Buffer.BlockCopy: 65.65 ms
Overlapped Array.Copy: 211.77 ms
Overlapped Buffer.BlockCopy: 202.94 ms
-------------------------
Block size: 1024 bytes
Non-overlapped Array.Copy: 52.93 ms
Non-overlapped Buffer.BlockCopy: 38.84 ms
Overlapped Array.Copy: 144.39 ms
Overlapped Buffer.BlockCopy: 154.09 ms
-------------------------
Block size: 2048 bytes
Non-overlapped Array.Copy: 45.64 ms
Non-overlapped Buffer.BlockCopy: 30.11 ms
Overlapped Array.Copy: 118.33 ms
Overlapped Buffer.BlockCopy: 109.16 ms
-------------------------
Block size: 4096 bytes
Non-overlapped Array.Copy: 30.93 ms
Non-overlapped Buffer.BlockCopy: 30.72 ms
Overlapped Array.Copy: 119.73 ms
Overlapped Buffer.BlockCopy: 104.66 ms
-------------------------
Block size: 8192 bytes
Non-overlapped Array.Copy: 30.37 ms
Non-overlapped Buffer.BlockCopy: 26.63 ms
Overlapped Array.Copy: 90.46 ms
Overlapped Buffer.BlockCopy: 87.40 ms
-------------------------
x64 JIT 上的结果:
Block size: 16 bytes
Non-overlapped Array.Copy: 1252.71 ms
Non-overlapped Buffer.BlockCopy: 694.34 ms
Overlapped Array.Copy: 701.27 ms
Overlapped Buffer.BlockCopy: 573.34 ms
-------------------------
Block size: 32 bytes
Non-overlapped Array.Copy: 995.47 ms
Non-overlapped Buffer.BlockCopy: 654.70 ms
Overlapped Array.Copy: 398.48 ms
Overlapped Buffer.BlockCopy: 336.86 ms
-------------------------
Block size: 64 bytes
Non-overlapped Array.Copy: 498.86 ms
Non-overlapped Buffer.BlockCopy: 329.15 ms
Overlapped Array.Copy: 218.43 ms
Overlapped Buffer.BlockCopy: 179.95 ms
-------------------------
Block size: 128 bytes
Non-overlapped Array.Copy: 263.00 ms
Non-overlapped Buffer.BlockCopy: 196.71 ms
Overlapped Array.Copy: 137.21 ms
Overlapped Buffer.BlockCopy: 107.02 ms
-------------------------
Block size: 256 bytes
Non-overlapped Array.Copy: 144.31 ms
Non-overlapped Buffer.BlockCopy: 101.23 ms
Overlapped Array.Copy: 85.49 ms
Overlapped Buffer.BlockCopy: 69.30 ms
-------------------------
Block size: 512 bytes
Non-overlapped Array.Copy: 76.76 ms
Non-overlapped Buffer.BlockCopy: 55.31 ms
Overlapped Array.Copy: 61.99 ms
Overlapped Buffer.BlockCopy: 54.06 ms
-------------------------
Block size: 1024 bytes
Non-overlapped Array.Copy: 44.01 ms
Non-overlapped Buffer.BlockCopy: 33.30 ms
Overlapped Array.Copy: 53.13 ms
Overlapped Buffer.BlockCopy: 51.36 ms
-------------------------
Block size: 2048 bytes
Non-overlapped Array.Copy: 27.05 ms
Non-overlapped Buffer.BlockCopy: 25.57 ms
Overlapped Array.Copy: 46.86 ms
Overlapped Buffer.BlockCopy: 47.83 ms
-------------------------
Block size: 4096 bytes
Non-overlapped Array.Copy: 29.11 ms
Non-overlapped Buffer.BlockCopy: 25.12 ms
Overlapped Array.Copy: 45.05 ms
Overlapped Buffer.BlockCopy: 47.84 ms
-------------------------
Block size: 8192 bytes
Non-overlapped Array.Copy: 24.95 ms
Non-overlapped Buffer.BlockCopy: 21.52 ms
Overlapped Array.Copy: 43.81 ms
Overlapped Buffer.BlockCopy: 43.22 ms
-------------------------
您已经用尽了 .NET 4.5 中添加的选项来直接影响 jitted 代码。下一步是查看生成的机器代码以发现任何明显的低效率。使用调试器这样做,首先要防止它禁用优化器。工具 + 选项,调试,常规,取消勾选“在模块加载时抑制 JIT 优化”选项。在热代码上设置断点,Debug + Disassembly 看一下。
没有太多需要考虑的,抖动优化器通常做得很好。要寻找的一件事是尝试消除数组边界检查失败,fixed关键字是一种不安全的解决方法。极端情况是内联方法的失败尝试,并且抖动没有有效地使用 cpu 寄存器,这是 x86 抖动的一个问题,并通过 MethodImplOptions.NoInlining 修复。优化器在从循环中提升不变代码方面效率不高,但这是您在寻找优化方法时盯着 C# 代码时几乎总是首先考虑的事情。
要知道的最重要的事情是您何时完成并且不能希望使其更快。您只能通过比较苹果和橘子并使用 C++/CLI 在本机代码中编写热门代码才能真正实现目标。确保这段代码是用#pragma unmanaged 编译的,所以它得到了优化器的全部喜爱。从托管代码切换到本机代码执行会产生相关成本,因此请确保本机代码的执行时间足够长。否则,这不一定容易做到,而且您肯定无法保证成功。尽管知道你已经完成了可以为你节省很多跌入死胡同的时间。