c++ - c++ 中的高性能代码（继承、函数指针、if）

Question

假设你有一个非常大的图，它的节点上有很多处理（比如每个节点有数千万次操作）。每个节点的核心例程都是相同的，但是根据内部条件还有一些额外的操作。可能有 2 个这样的条件产生 4 个案例 (0,0), (1,0), (0,1), (1,1)。例如 (1,1) 意味着两个条件都成立。条件在程序中建立一次（每个节点独立设置一组），从那时起，永远不会改变。不幸的是，它们是在运行时以完全不可预测的方式确定的（基于通过 HTTP 从外部服务器接收的数据）。

在这种情况下最快的是什么？（考虑到我不知道的现代编译器优化）

只需使用“IFs”：如果（条件 X）执行附加操作 X。
使用继承从基类派生四个类（暴露方法OPERATION）以进行正确的操作并节省数以千万计的“ifs”。[但我不确定这是否是真正的节省，继承也必须有它的开销）
使用指向函数的指针根据条件分配函数一次。

我需要很长时间才能达到可以自己测试的地步（我还没有这么大的数据，这将被集成到更大的项目中，因此测试所有版本并不容易）。

阅读答案：我知道我可能必须尝试一下。但除此之外，这是一个更快的问题：

数以百万计的 IF 语句和正常的静态已知函数调用 VS 函数指针调用 VS 继承，我认为在这种情况下这不是最好的主意，我正在考虑从进一步检查中消除它感谢任何建设性的答案（不是说我应该不关心这些小事；）

score 3 · Accepted Answer

除了测量真实数据上的实际代码外，没有真正的答案。有时，在过去，我不得不处理这样的问题，在我实际测量的情况下，虚函数比 if 更快。但这并不意味着什么，因为我测量的案例与你的程序不同（因此也有不同的背景）。例如，虚函数调用通常会阻止内联，而 if 本质上是内联的，内联可能会带来额外的优化可能性。

我测量的机器也很好地处理了虚拟功能；我听说其他一些机器（例如 HP 的 PA）在实现间接跳转时非常无效（不仅包括虚函数调用，还包括函数的返回 --- 再次，失去了内联成本的机会）。

score 2 · Accepted Answer

If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.

Why would you want to do this:

No ifs for the state switching
No vtables
No function indirection

But much more importantly:

No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)

Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.

Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.

Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.

Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.

P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.

score 1 · Accepted Answer

实际上，我对分支预测的有效性印象深刻，并且只有if解决方案允许内联，这也可能是戏剧性的。虚函数和指向函数的指针也涉及从内存加载，并可能导致缓存未命中

但是，您有四个条件，因此分支未命中可能会很昂贵。

没有能力测试和验证答案真的无法回答。特别是因为它甚至不清楚这将是一个足以保证优化工作的性能瓶颈。

在这种情况下。我会在可读性和易于调试方面犯错，如果

score 1 · Accepted Answer

许多程序员参加过关于某些最喜欢的主题的课程和阅读书籍：流水线、缓存、分支预测、虚函数、编译器优化、big-O 算法等，以及这些主题的性能。

如果我可以对划船进行类比，这些是诸如减轻重量、调整功率、调整平衡和流线型之类的事情，假设您从一艘已经接近最佳状态的快艇开始。

没关系，您实际上可能是从玛丽皇后号出发的，并且您假设它是一艘快艇。

很可能有一些方法可以大大加快代码速度，只要你知道它在哪里，只需通过减少脂肪（伪装成好的设计）。好吧，你不知道它在哪里，猜测在哪里是浪费时间，除非你重视错误。

当人们说“测量并使用分析器”时，他们指的是正确的方向，但还不够远。 这是我如何做的一个例子，我制作了一个粗略的视频，FWIW。

score 0 · Accepted Answer

除非这些属性有明确的模式，否则不存在可以为您有效预测这种数据相关条件的分支预测器。在这种情况下，您最好避免控制推测（并支付分支错误预测的惩罚），并等待实际数据到达并解决控制流（更可能使用虚函数发生）。当然，您必须进行基准测试以验证这一点，因为它取决于实际模式（例如，如果您甚至有小组类似的“标记”元素）。

上面建议的排序很好，但请注意，它将一个简单的 O(n) 问题转换为 O(logn) 问题，所以对于大尺寸，除非你可以排序一次，否则你会丢失 - 遍历多次，或者否则廉价地维护排序状态。

请注意，某些预测器也可能会尝试预测函数调用的地址，因此您可能在那里面临同样的问题。

但是，我必须同意关于早期优化的评论——你确定控制流是你的瓶颈吗？如果从内存中获取实际数据需要更长的时间怎么办？一般来说，您的元素似乎可以并行处理，因此即使您在单个线程上运行它（如果您使用多个内核则更多） - 您应该受带宽限制而不是延迟限制。

c++ - c++ 中的高性能代码（继承、函数指针、if）

5 回答 5

Related

Reference