.net - 这个虚方法调用怎么比密封方法调用快？

Question

我正在对虚拟成员与密封成员的性能进行一些修补。

下面是我的测试代码。

输出是

virtual total 3166ms
per call virtual 3.166ns
sealed total 3931ms
per call sealed 3.931ns

我一定是做错了什么，因为据此虚拟调用比密封调用更快。

我在“优化代码”打开的情况下以发布模式运行。

编辑：在 VS（作为控制台应用程序）之外运行时，时间接近死胡同。但虚拟几乎总是出现在前面。

[TestFixture]
public class VirtTests
{

    public class ClassWithNonEmptyMethods
    {
        private double x;
        private double y;

        public virtual void VirtualMethod()
        {
            x++;
        }
        public void SealedMethod()
        {
            y++;
        }
    }

    const int iterations = 1000000000;


    [Test]
    public void NonEmptyMethodTest()
    {

        var foo = new ClassWithNonEmptyMethods();
        //Pre-call
        foo.VirtualMethod();
        foo.SealedMethod();

        var virtualWatch = new Stopwatch();
        virtualWatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            foo.VirtualMethod();
        }
        virtualWatch.Stop();
        Console.WriteLine("virtual total {0}ms", virtualWatch.ElapsedMilliseconds);
        Console.WriteLine("per call virtual {0}ns", ((float)virtualWatch.ElapsedMilliseconds * 1000000) / iterations);


        var sealedWatch = new Stopwatch();
        sealedWatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            foo.SealedMethod();
        }
        sealedWatch.Stop();
        Console.WriteLine("sealed total {0}ms", sealedWatch.ElapsedMilliseconds);
        Console.WriteLine("per call sealed {0}ns", ((float)sealedWatch.ElapsedMilliseconds * 1000000) / iterations);

    }

}

score 4 · Accepted Answer

You are testing the effects of memory alignment on code efficiency. The 32-bit JIT compiler has trouble generating efficient code for value types that are more than 32 bits in size, long and double in C# code. The root of the problem is the 32-bit GC heap allocator, it only promises alignment of allocated memory on addresses that are a multiple of 4. That's an issue here, you are incrementing doubles. A double is efficient only when it is aligned on an address that's a multiple of 8. Same issue with the stack, in case of local variables, it is also aligned only to 4 on a 32-bit machine.

The L1 CPU cache is internally organized in blocks called a "cache line". There is a penalty when the program reads a mis-aligned double. Especially one that straddles the end of a cache line, bytes from two cache lines have to be read and glued together. Mis-alignment isn't uncommon in the 32-bit jitter, it is merely 50-50 odds that the 'x' field happens to be allocated on an address that's a multiple of 8. If it isn't then 'x' and 'y' are going to be misaligned and one of them may well straddle the cache line. The way you wrote the test, that's going to either make VirtualMethod or SealedMethod slower. Make sure you let them use the same field to get comparable results.

The same is true for code. Swap the code for the virtual and sealed test to arbitrarily change the outcome. I had no trouble making the sealed test quite a bit faster that way. Given the modest difference in speed, you are probably looking at a code alignment issue. The x64 jitter makes an effort to insert NOPs to get a branch target aligned, the x86 jitter doesn't.

You should also run the timing test several times in a loop, at least 20. You are likely to then also observe the effect of the garbage collector moving the class object. The double may have a different alignment afterward, dramatically changing the timing. Accessing a 64-bit value type value like long or double has 3 distinct timings, aligned on 8, aligned on 4 within a cache line, and aligned on 4 across two cache lines. In fast to slow order.

The penalty is steep, reading a double that straddles a cache line is roughly three times slower than reading an aligned one. Also the core reason why a double[] (array of doubles) is allocated in the Large Object Heap even when it has only 1000 elements, well south of the normal threshold of 80KB, the LOH has an alignment guarantee of 8. These alignment problems entirely disappear in code generated by the x64 jitter, both the stack and the GC heap have an alignment of 8.

score 1 · Accepted Answer

您可能会看到一些启动成本。尝试将 Test-A/Test-B 代码包装在一个循环中并运行几次。您可能还会看到某种排序效果。为避免这种情况（以及循环效果的顶部/底部），请将其展开 2-3 次。

score 1 · Accepted Answer

~~首先，您必须标记方法sealed。~~

其次，提供一个override虚拟方法。创建派生类的实例。

作为第三个测试，创建一个sealed override方法。

现在你可以开始比较了。

编辑：您可能应该在 VS 之外运行它。

更新：

我的意思的例子。

abstract class Foo
{
  virtual void Bar() {}
}

class Baz : Foo
{
  sealed override void Bar() {}
}

class Woz : Foo
{
  override void Bar() {}
}

现在测试和Bar的一个实例的调用速度。我还怀疑程序集之外的成员和类可见性可能会影响 JIT 分析。BazWoz

score 0 · Accepted Answer

以下代码作为我们测试的参考，让我们使用 Ildasm.exe (IL Disassembler) 工具分析编译器生成的Microsoft 中间语言 (MSIL)信息。

public sealed class Sealed
{
    public string Message { get; set; }
    public void DoStuff() { }
}
public class Derived : Base
{
    public sealed override void DoStuff() { }
}
public class Base
{
    public string Message { get; set; }
    public virtual void DoStuff() { }
}
static void Main()
{
    Sealed sealedClass = new Sealed();
    sealedClass.DoStuff();
    Derived derivedClass = new Derived();
    derivedClass.DoStuff();
    Base BaseClass = new Base();
    BaseClass.DoStuff();
}

要运行此工具，请打开 Visual Studio 的开发人员命令提示符并执行命令ildasm。

**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.9.13
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************


C:\Program Files (x86)\Microsoft Visual Studio\2017\Community>ildasm

启动应用程序后，加载前一个应用程序的可执行文件（或程序集）

没有为此图像提供替代文本双击 Main 方法以查看 Microsoft 中间语言 (MSIL) 信息。

.method private hidebysig static void  Main() cil managed
{
  .entrypoint
  // Code size       41 (0x29)
  .maxstack  8
  IL_0000:  newobj     instance void ConsoleApp1.Program/Sealed::.ctor()
  IL_0005:  callvirt   instance void ConsoleApp1.Program/Sealed::DoStuff()
  IL_000a:  newobj     instance void ConsoleApp1.Program/Derived::.ctor()
  IL_000f:  callvirt   instance void ConsoleApp1.Program/Base::DoStuff()
  IL_0014:  newobj     instance void ConsoleApp1.Program/Base::.ctor()
  IL_0019:  callvirt   instance void ConsoleApp1.Program/Base::DoStuff()
  IL_0028:  ret
} // end of method Program::Main

如您所见，每个类都使用newobj通过将对象引用推入堆栈来创建新实例，并callvirt调用其各自对象的 DoStuff() 方法的后期绑定。

根据这些信息判断，编译器似乎以相同的方式管理密封类、派生类和基类。可以肯定的是，让我们通过使用 Visual Studio 中的反汇编窗口分析JIT 编译的代码来更深入地了解。

通过在Tools > Options > Debugging > General下选择 Enable address-level debugging 来启用反汇编。

没有为此图像提供替代文本在应用程序的开头设置一个刹车点并开始调试。一旦应用程序达到制动点，通过选择Debug > Windows > Disassembly打开 Disassembly 窗口。

--- C:\Users\Ivan Porta\source\repos\ConsoleApp1\Program.cs --------------------
        {
0066084A  in          al,dx  
0066084B  push        edi  
0066084C  push        esi  
0066084D  push        ebx  
0066084E  sub         esp,4Ch  
00660851  lea         edi,[ebp-58h]  
00660854  mov         ecx,13h  
00660859  xor         eax,eax  
0066085B  rep stos    dword ptr es:[edi]  
0066085D  cmp         dword ptr ds:[5842F0h],0  
00660864  je          0066086B  
00660866  call        744CFAD0  
0066086B  xor         edx,edx  
0066086D  mov         dword ptr [ebp-3Ch],edx  
00660870  xor         edx,edx  
00660872  mov         dword ptr [ebp-48h],edx  
00660875  xor         edx,edx  
00660877  mov         dword ptr [ebp-44h],edx  
0066087A  xor         edx,edx  
0066087C  mov         dword ptr [ebp-40h],edx  
0066087F  nop  
            Sealed sealedClass = new Sealed();
00660880  mov         ecx,584E1Ch  
00660885  call        005730F4  
0066088A  mov         dword ptr [ebp-4Ch],eax  
0066088D  mov         ecx,dword ptr [ebp-4Ch]  
00660890  call        00660468  
00660895  mov         eax,dword ptr [ebp-4Ch]  
00660898  mov         dword ptr [ebp-3Ch],eax  
            sealedClass.DoStuff();
0066089B  mov         ecx,dword ptr [ebp-3Ch]  
0066089E  cmp         dword ptr [ecx],ecx  
006608A0  call        00660460  
006608A5  nop  
            Derived derivedClass = new Derived();
006608A6  mov         ecx,584F3Ch  
006608AB  call        005730F4  
006608B0  mov         dword ptr [ebp-50h],eax  
006608B3  mov         ecx,dword ptr [ebp-50h]  
006608B6  call        006604A8  
006608BB  mov         eax,dword ptr [ebp-50h]  
006608BE  mov         dword ptr [ebp-40h],eax  
            derivedClass.DoStuff();
006608C1  mov         ecx,dword ptr [ebp-40h]  
006608C4  mov         eax,dword ptr [ecx]  
006608C6  mov         eax,dword ptr [eax+28h]  
006608C9  call        dword ptr [eax+10h]  
006608CC  nop  
            Base BaseClass = new Base();
006608CD  mov         ecx,584EC0h  
006608D2  call        005730F4  
006608D7  mov         dword ptr [ebp-54h],eax  
006608DA  mov         ecx,dword ptr [ebp-54h]  
006608DD  call        00660490  
006608E2  mov         eax,dword ptr [ebp-54h]  
006608E5  mov         dword ptr [ebp-44h],eax  
            BaseClass.DoStuff();
006608E8  mov         ecx,dword ptr [ebp-44h]  
006608EB  mov         eax,dword ptr [ecx]  
006608ED  mov         eax,dword ptr [eax+28h]  
006608F0  call        dword ptr [eax+10h]  
006608F3  nop  
        }
0066091A  nop  
0066091B  lea         esp,[ebp-0Ch]  
0066091E  pop         ebx  
0066091F  pop         esi  
00660920  pop         edi  
00660921  pop         ebp  

00660922  ret

正如我们在前面的代码中看到的，虽然对象的创建是相同的，但调用密封类和派生类/基类的方法所执行的指令略有不同。将数据移入RAM的寄存器（mov指令）后，调用密封方法，在实际调用该方法之前执行dword ptr [ecx]和ecx（cmp指令）的比较。

根据 Torbj¨orn Granlund 撰写的报告，AMD 和 Intel x86 处理器的指令延迟和吞吐量，Intel Pentium 4 中以下指令的速度为：

mov：有 1 个周期作为延迟，处理器可以在这种类型的每个周期内维持 2.5 条指令
cmp：有 1 个周期作为延迟，处理器可以在这种类型的每个周期中维持 2 条指令

综上所述，现在编译器和处理器的优化使得密封类和非密封类之间的性能基本上很少，以至于与大多数应用程序无关。

参考

新对象： https ://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.newobj?view=netframework-4.8
Callvirt： https ://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.callvirt?view=netframework-4.8
反汇编： https ://docs.microsoft.com/en-us/visualstudio/debugger/how-to-use-the-disassembly-window?view=vs-2019
x86 指令：
https ://www.aldeid.com/wiki/X86-assembly/Instructions
AMD 和 Intel x86 处理器的指令延迟和吞吐量：https ://gmplib.org/~tege/x86-timing.pdf

.net - 这个虚方法调用怎么比密封方法调用快？

4 回答 4

Related

Reference