cpu - 为什么不只预测两个分支？

Question

CPU 使用分支预测来加速代码，但前提是实际采用第一个分支。

为什么不简单地同时使用两个分支？也就是说，假设两个分支都会被命中，缓存两边，并在必要时采用正确的分支。缓存不需要失效。虽然这需要编译器事先加载两个分支（更多内存、正确布局等），但我认为适当的优化可以简化这两个分支，以便从单个预测器获得接近最佳的结果。也就是说，需要更多内存来加载两个分支（对于 N 个分支来说是指数级的），大多数时候应该能够在完成执行所采用的分支之前足够快地用新代码“重新缓存”失败的分支.

if (x) Bl else Br;

与其假设 Bl 被采用，不如假设 Bl 和 Br 都被采用（某种类型的并行处理或特殊交错），并且在实际确定分支之后，一个分支无效，然后可以释放缓存以供使用（也许一些需要某种特殊技术来正确填充和使用它）。

事实上，不需要预测电路，所有用于预测的设计都可以用来处理两个分支。

如果这可行，有什么想法吗？

score 11 · Accepted Answer

A Historical Perspective on Fetching Instructions from both Paths

The first similar proposal (to my knowledge) was discussed in this 1968 patent. I understand that you are only asking about fetching instructions from both branches, but bear with me a little. In that patent, three broad strategies were laid out, one of them is following both paths (the fall-through path and the branch path). That is, not just fetching instructions from both paths, but also executing both paths. When the conditional branch instruction is resolved, one of the paths is discarded. It was only mentioned as an idea in the introduction of the patent, but the patent itself was about another invention.

Later in 1977, a commercial processor was released from IBM, called the IBM 3033 processor. That is the first processor (to my knowledge) to implement exactly what you are proposing. I'm surprised to see that the Wikipedia page did not mention that the processor fetched instructions from both paths. The paper that describes the IBM 3033 is titled "The IBM 3033: An inside look". Unfortunately, I'm not able to find the paper. But the paper on the IBM 3090 does mention that fact. So what you're proposing did make sense and was implemented in real processors about half a decade ago.

A patent was filed in 1981 and granted in 1984 about processor with two memories and instructions can be fetched from both memories simultaneously. I quote from the abstract of the patent:

A dual fetch microsequencer having two single-ported microprogram memories wherein both the sequential and jump address microinstructions of a binary conditional branch can be simultaneously prefetched, one from each memory. The microprogram is assembled so that the sequential and jump addresses of each branch have opposite odd/even polarities. Accordingly, with all odd addresses in one memory and even in the other, the first instruction of both possible paths can always be prefetched simultaneously. When a conditional branch microinstruction is loaded into the execution register, its jump address or a value corresponding to it is transferred to the address register for the appropriate microprogram memory. The address of the microinstruction in the execution register is incremented and transferred to the address register of the other microprogram memory. Prefetch delays are thereby reduced. Also, when a valid conditional jump address is not provided, that microprogram memory may be transparently overlayed during that microcycle.

A Historical Perspective on Fetching and Executing Instructions from both Paths

There is a lot of research published in the 80s and 90s about proposing and evaluating techniques by which instructions from both paths are not only fetched but also executed, even for multiple conditional branches. This will have the potential additional overhead of fetching data required by both paths. The idea of branch prediction confidence was proposed in this paper in 1996 and was used to improve such techniques by being more selective regarding which paths to fetch and execute. Another paper (Threaded Multiple Path Execution) published in 1998 proposes an architecture that exploits simultaneous multithreading (SMT) to run multiple paths following conditional branches. Another paper (Dual Path Instruction Processing) published in 2002 proposes to fetch, decode, and rename, but not execute, instructions from both paths.

Discussion

Fetching instructions from both paths into one or more of the caches reduces the effective capacity of the caches in general, because, typically, one of the paths will be executed much more frequently than the other (in some, potentially highly irregular, pattern). Imagine fetching into the L3 cache, which practically always shared between all the cores and holds both instructions and data. This can have negative impact on the ability of the L3 cache to hold useful data. Fetching into the much smaller L2 cache can even lead to a substantially worse performance, especially when the L3 is inclusive. Fetching instructions from both paths across multiple conditional branches for all the cores may cause hot data held in the caches to be frequently evicted and brought back. Therefore, extreme variants of the technique you are proposing would reduce the overall performance of modern architectures. However, less aggressive variants can be beneficial.

I'm not aware of any real modern processors that fetch instructions on both paths when they see a conditional branch (perhaps some do, but it's not publicly disclosed). But instruction prefetching has been extensively researched and still is. An important question here that needs to be addressed is: what is the probability that a sufficient number of instructions from the other path are already present in the cache when the predicted path turns out to be the wrong path? If the probability is high, then there would be little motivation to fetch instructions from both paths. Otherwise, there is indeed an opportunity. According to an old paper from Intel (Wrong-Path Instruction Prefetching), on the benchmarks tested, over 50% of instructions accessed on mispredicted paths were later accessed during correct path execution. The answer to this question certainly depends on the target domain of the processor being designed.

cpu - 为什么不只预测两个分支？

1 回答 1

A Historical Perspective on Fetching Instructions from both Paths

A Historical Perspective on Fetching and Executing Instructions from both Paths

Discussion

Related

Reference