13

在使用继承和泛型时,我一直无法理解在Func<...>整个代码中使用的性能特征——这是我发现自己一直在使用的组合。

让我从一个最小的测试用例开始,这样我们都知道我们在说什么,然后我将发布结果,然后我将解释我的期望以及为什么......

最小测试用例

public class GenericsTest2 : GenericsTest<int> 
{
    static void Main(string[] args)
    {
        GenericsTest2 at = new GenericsTest2();

        at.test(at.func);
        at.test(at.Check);
        at.test(at.func2);
        at.test(at.Check2);
        at.test((a) => a.Equals(default(int)));
        Console.ReadLine();
    }

    public GenericsTest2()
    {
        func = func2 = (a) => Check(a);
    }

    protected Func<int, bool> func2;

    public bool Check2(int value)
    {
        return value.Equals(default(int));
    }

    public void test(Func<int, bool> func)
    {
        using (Stopwatch sw = new Stopwatch((ts) => { Console.WriteLine("Took {0:0.00}s", ts.TotalSeconds); }))
        {
            for (int i = 0; i < 100000000; ++i)
            {
                func(i);
            }
        }
    }
}

public class GenericsTest<T>
{
    public bool Check(T value)
    {
        return value.Equals(default(T));
    }

    protected Func<T, bool> func;
}

public class Stopwatch : IDisposable
{
    public Stopwatch(Action<TimeSpan> act)
    {
        this.act = act;
        this.start = DateTime.UtcNow;
    }

    private Action<TimeSpan> act;
    private DateTime start;

    public void Dispose()
    {
        act(DateTime.UtcNow.Subtract(start));
    }
}

结果

Took 2.50s  -> at.test(at.func);
Took 1.97s  -> at.test(at.Check);
Took 2.48s  -> at.test(at.func2);
Took 0.72s  -> at.test(at.Check2);
Took 0.81s  -> at.test((a) => a.Equals(default(int)));

我期望什么以及为什么

我希望这段代码对于所有 5 种方法都以完全相同的速度运行,更准确地说,甚至比其中任何一种方法都快,即与以下方法一样快:

using (Stopwatch sw = new Stopwatch((ts) => { Console.WriteLine("Took {0:0.00}s", ts.TotalSeconds); }))
{
    for (int i = 0; i < 100000000; ++i)
    {
        bool b = i.Equals(default(int));
    }
}
// this takes 0.32s ?!?

我预计它需要 0.32 秒,因为我看不出 JIT 编译器在这种特殊情况下不内联代码的任何理由。

仔细观察,我根本不了解这些性能数字:

  • at.func传递给函数并且在执行期间不能更改。为什么这不是内联的?
  • at.Check显然比 快at.Check2,而两者都不能被覆盖并且 at.Check 在类 GenericsTest2 的情况下像石头一样固定
  • Func<int, bool>在传递内联Func而不是转换为的方法时,我认为没有理由放慢速度Func
  • 为什么测试用例 2 和 3 之间的差异高达 0.5 秒,而案例 4 和 5 之间的差异是 0.1 秒——它们不应该是相同的吗?

问题

我真的很想理解这一点......这里发生了什么,使用通用基类比内联整个批次慢 10 倍?

所以,基本上问题是:为什么会发生这种情况,我该如何解决?

更新

根据到目前为止的所有评论(谢谢!)我做了更多的挖掘。

首先,重复测试并将循环放大 5 倍并执行 4 次时的一组新结果。我使用了诊断秒表并添加了更多测试(也添加了描述)。

(Baseline implementation took 2.61s)

--- Run 0 ---
Took 3.00s for (a) => at.Check2(a)
Took 12.04s for Check3<int>
Took 12.51s for (a) => GenericsTest2.Check(a)
Took 13.74s for at.func
Took 16.07s for GenericsTest2.Check
Took 12.99s for at.func2
Took 1.47s for at.Check2
Took 2.31s for (a) => a.Equals(default(int))
--- Run 1 ---
Took 3.18s for (a) => at.Check2(a)
Took 13.29s for Check3<int>
Took 14.10s for (a) => GenericsTest2.Check(a)
Took 13.54s for at.func
Took 13.48s for GenericsTest2.Check
Took 13.89s for at.func2
Took 1.94s for at.Check2
Took 2.61s for (a) => a.Equals(default(int))
--- Run 2 ---
Took 3.18s for (a) => at.Check2(a)
Took 12.91s for Check3<int>
Took 15.20s for (a) => GenericsTest2.Check(a)
Took 12.90s for at.func
Took 13.79s for GenericsTest2.Check
Took 14.52s for at.func2
Took 2.02s for at.Check2
Took 2.67s for (a) => a.Equals(default(int))
--- Run 3 ---
Took 3.17s for (a) => at.Check2(a)
Took 12.69s for Check3<int>
Took 13.58s for (a) => GenericsTest2.Check(a)
Took 14.27s for at.func
Took 12.82s for GenericsTest2.Check
Took 14.03s for at.func2
Took 1.32s for at.Check2
Took 1.70s for (a) => a.Equals(default(int))

我从这些结果中注意到,当您开始使用泛型时,它会变得慢得多。深入挖掘我为非泛型实现找到的 IL:

L_0000: ldarga.s 'value'
L_0002: ldc.i4.0 
L_0003: call instance bool [mscorlib]System.Int32::Equals(int32)
L_0008: ret 

对于所有通用实现:

L_0000: ldarga.s 'value'
L_0002: ldloca.s CS$0$0000
L_0004: initobj !T
L_000a: ldloc.0 
L_000b: box !T
L_0010: constrained. !T
L_0016: callvirt instance bool [mscorlib]System.Object::Equals(object)
L_001b: ret 

虽然其中大部分都可以优化,但我认为这callvirt可能是一个问题。

为了让它更快,我在方法的定义中添加了 'T : IEquatable' 约束。结果是:

L_0011: callvirt instance bool [mscorlib]System.IEquatable`1<!T>::Equals(!0)

虽然我现在对性能有了更多了解(它可能无法内联,因为它创建了一个 vtable 查找),但我仍然感到困惑:为什么它不简单地调用 T::Equals?毕竟,我确实指定它会在那里......

4

2 回答 2

8

始终运行微基准测试 3 次。第一个将触发 JIT 并排除它。检查第 2 次和第 3 次运行是否相等。这给出了:

... run ...
Took 0.79s
Took 0.63s
Took 0.74s
Took 0.24s
Took 0.32s
... run ...
Took 0.73s
Took 0.63s
Took 0.73s
Took 0.24s
Took 0.33s
... run ...
Took 0.74s
Took 0.63s
Took 0.74s
Took 0.25s
Took 0.33s

线

func = func2 = (a) => Check(a);

添加一个额外的函数调用。删除它

func = func2 = this.Check;

给出:

... 1. run ...
Took 0.64s
Took 0.63s
Took 0.63s
Took 0.24s
Took 0.32s
... 2. run ...
Took 0.63s
Took 0.63s
Took 0.63s
Took 0.24s
Took 0.32s
... 3. run ...
Took 0.63s
Took 0.63s
Took 0.63s
Took 0.24s
Took 0.32s

这表明 1. 和 2. run 之间的(JIT?)效果由于删除了函数调用而消失了。前 3 个测试现在相等

在测试 4 和 5 中,编译器可以将函数参数内联到 void test(Func<>),而在测试 1 到 3 中,编译器要确定它们是常量还有很长的路要走。有时,从我们的编码人员的角度来看,编译器存在一些不容易看到的约束,例如 .Net 和 Jit 约束来自 .Net 程序的动态特性,与由 c++ 生成的二进制文件相比。无论如何,在这里有所不同的是函数 arg 的内联。

4和5的区别?好吧,test5 看起来编译器也可以很容易地内联函数。也许他为闭包构建了一个上下文并解决它比需要的更复杂一些。没有深入了解 MSIL。

使用 .Net 4.5 进行上述测试。这里有 3.5,证明编译器通过内联变得更好:

... 1. run ...
Took 1.06s
Took 1.06s
Took 1.06s
Took 0.24s
Took 0.27s
... 2. run ...
Took 1.06s
Took 1.08s
Took 1.06s
Took 0.25s
Took 0.27s
... 3. run ...
Took 1.05s
Took 1.06s
Took 1.05s
Took 0.24s
Took 0.27s

和.Net 4:

... 1. run ...
Took 0.97s
Took 0.97s
Took 0.96s
Took 0.22s
Took 0.30s
... 2. run ...
Took 0.96s
Took 0.96s
Took 0.96s
Took 0.22s
Took 0.30s
... 3. run ...
Took 0.97s
Took 0.96s
Took 0.96s
Took 0.22s
Took 0.30s

现在将 GenericTest<> 更改为 GenericTest !

... 1. run ...
Took 0.28s
Took 0.24s
Took 0.24s
Took 0.24s
Took 0.27s
... 2. run ...
Took 0.24s
Took 0.24s
Took 0.24s
Took 0.24s
Took 0.27s
... 3. run ...
Took 0.25s
Took 0.25s
Took 0.25s
Took 0.24s
Took 0.27s

嗯,这是来自 C# 编译器的一个惊喜,类似于我遇到的密封类以避免虚函数调用。也许埃里克·利珀特对此有话要说?

删除对聚合的继承可以恢复性能。我学会了从不使用继承,很少使用,并且强烈建议您至少在这种情况下避免使用它。(这是我对这个问题的务实解决方案,没有任何火焰战争的意图)。我一直使用接口,它们没有性能损失。

于 2013-03-28T10:55:11.937 回答
3

我将解释我认为在这里和所有泛型发生了什么。我需要一些空间来写,所以我将其发布为答案。感谢大家的评论和帮助解决这个问题,我会确保在这里和那里奖励积分。

开始...

编译泛型

众所周知,泛型是编译器在运行时填写类型信息的“模板”类型。它可以根据约束做出假设,但不会更改 IL 代码……(稍后会详细介绍)。

我的问题中的一种方法:

public class Foo<T>
{
    public void bool Handle(T foo) 
    {
        return foo.Equals(default(T));
    }
}

这里的约束T是一个,这意味着对 Object.EqualsObject的调用。Equals由于 T 正在实现 Object.Equals,这将如下所示:

L_0016: callvirt instance bool [mscorlib]System.Object::Equals(object)

我们可以通过添加约束来明确T实现这一点来改进这一点。这会将调用更改为:EqualsT : IEquatable<T>

L_0011: callvirt instance bool [mscorlib]System.IEquatable`1<!T>::Equals(!0)

但是,由于 T 还没有填写,显然 IL 不支持T::Equals(!0)直接调用,即使它肯定存在。编译器显然只能假设约束已经满足,因此它需要调用IEquatable1` 来定义方法。

显然,像sealed这样的提示并没有什么不同,即使它们应该有。

结论:因为T::Equals(!0)不支持,所以需要vtable查找才能使其工作。一旦它变成了callvirt.JIT 编译器就很难确定它应该只使用call.

应该发生什么:基本上微软应该T::Equals(!0)在这种方法明确存在时支持。这将调用更改为callIL 中的正常调用,使其更快。

但它变得更糟

那么调用 Foo::Handle 呢?

令我惊讶的是,对的调用Foo<T>::Handle也是 acallvirt而不是 a call。可以为 f.ex 找到相同的行为。List<T>::Add等等。我的观察是,只有使用的调用this才会成为正常的call;其他所有内容都将编译为callvirt.

结论:这种行为就像你得到了一个类结构Foo<int>:Foo<T>:[the rest],这实际上没有任何意义。显然,从该类外部对泛型类的所有调用都将编译 vtable 查找。

应该发生什么:如果方法是非虚拟的, Microsoft 应该将其更改callvirt为 a 。callcallvirt 真的没有任何理由。

结论

如果您使用其他类型的泛型,请准备好使用 acallvirt而不是 a call,即使这不是必需的。由此产生的性能基本上是您可以从这样的调用中所期望的......

恕我直言,这是一个真正的耻辱。类型安全应该帮助开发人员,同时让你的代码更快,因为编译器可以对正在发生的事情做出假设。我从这一切中学到的教训是:不要使用泛型,除非你不关心额外的 vtable 查找(直到微软修复了这个)

未来的工作

首先,我将在 Microsoft Connect 上发布此内容。我认为这是 .NET 中的一个严重错误,它会在没有任何充分理由的情况下降低性能。(https://connect.microsoft.com/VisualStudio/feedback/details/782346/using-generics-will-always-compile-to-callvirt-even-if-this-is-not-necessary


来自 Microsoft Connect 的结果

是的,我们有了结果,我要感谢 Mike Danes!

方法调用foo.Equals(default(T))将编译为,Object.Equals(boxed[new !0])因为所有 T 的唯一共同点是Object.Equals. 这将导致装箱操作和 vtable 查找。

如果我们希望事物使用正确的 Equals,我们必须给编译器一个提示,即类型 implement bool Equals(T)。这可以通过告诉编译器类型T实现来完成IEquatable<T>

换句话说:更改类的签名如下:

public class GenericsTest<T> where T:IEquatable<T>
{
    public bool Check(T value)
    {
        return value.Equals(default(T));
    }

    protected Func<T, bool> func;
}

当你这样做时,运行时会找到正确的Equals方法。呸...

要彻底解决这个难题,还需要一个元素:.NET 4.5。.NET 4.5 的运行时能够内联此方法,从而使其再次达到应有的速度。在 .NET 4.0(这是我目前正在使用的)中,此功能似乎不存在。在 IL 中调用仍然是 a callvirt,但运行时无论如何都会解决这个难题。

如果你测试这段代码,它应该和最快的测试用例一样快。有人可以确认一下吗?

于 2013-03-28T12:55:29.213 回答