Branches in a Pipeline

First, let's add the text's branch hardware to the pipeline. The text actually does this twice; first, a spectacularly bad implementation that follows the five-step pipeline slavishly, and second, a substantially better implementation that short-circuits the pipe.

The DLX has two conditional branch instructions, and a small number of unconditional jumps. The conditional branches are a branch on zero, and a branch on non-zero. In the book's first branch implementation, the branch decision is made, and the branch target calculated, in the execute stage. A new value is loaded into the PC in the mem stage, and the branch target fetched in the branch instruction's wb stage.

Again, pursuing a naive strategy, we stop fetching instructions once we've seen a branch, and don't resume until we've decided whether to take it. This results in a three cycle stall whenever a branch occurs; this is called the branch penalty.

In benchmarks, about 19% of this instruction set's executed instructions are branches of one sort or another (see p. 105). If branches are the only thing that cause stalls (big if!), then instructions take .81*1 + .19*4 = 1.57 cpi. That's a 40% performance reduction! We need to do better.

There are two things we can do about the branch penalty: we can try to reduce the penalty, or we can reduce the frequency with which we have to put up with the penalty. The former is a hardware design issue; the latter can be dealt with either in hardware or software.

Reducing the Branch Penalty

Let's take a close look at the branch part of the pipeline, and see when we actually have the information we need to perform the branch. Remember the form of the branch is
BxxZ offset
where xx is either EQ or NE (so it's either a BEQZ or a BNEZ). The offset is added to PC+4 to generate the target address.

Note the following: the offset is in the instruction, so it is available at decode time. PC+4 is also available by now. So (if we throw some hardware at it) we can be generating the target address at the same time we are deciding whether to take the branch. In fact, we can be doing all this at the same time we are deciding whether the instruction even is a branch at all! We can load the new PC value during the exe stage (one cycle earlier than before), and fetch the new instruction on the mem stage. This reduces the branch penalty to two cycles, and the CPI is improved to 1.38. Definitely worth it.

Actually, I think we can cut this to a one cycle penalty, but the point is made that it can be reduced. The next issue is reducing the number of times a branch penalty has to be paid. If the hardware can make a guess as to whether a branch is likely to be taken, it can proceed on the basis of that prediction until it knows whether the branch will actually be taken. We can have a series of more elaborate schemes for making predictions.

Hardware Static Branch Prediction

We can start by making the same prediction for all branches: either always assume the branch will be taken, or assume it will not. In the case of the sample pipeline, it very simple to assume the branch is not taken; just keep fetching and executing instructions. You will know whether the branch was to be taken by the time anything is written, so instructions are easy to cancel. We now have no penalty for a branch-not-taken, but still have a three-cycle branch-taken penalty.

The only problem with this is that most branches are taken. For this architecture on the Spec benchmarks, we see 60% of forward branches are taken, and 85% of backward branches; the average across all branches is about 67%. In some designs, we could take advantage of this by predicting branches are taken; this would reduce or eliminate the branch penalty in 67% of cases instead of 33%. The text makes the claim that this actually won't do us any good in the DLX pipeline since we make the branch decision on the same cycle that we have generated the target address; this actually isn't quite true, since a one-cycle stall is required in determining whether the branch is to be taken if the branch immediately follows the instruction in which the register is set (the text makes the correct observation if the machine uses condition codes, though).

Delayed Branches

On machines that use delayed branches, making use of them in the compiler is quite challenging. The easy answer is to just always put a NOP in the delay slot; this makes it act just like having a stall instead. What an optimizing compiler will do instead is to try to find an instruction that can be placed in the delay slot. For instance, it is frequently possible to duplicate the first instruction of the loop, put it in the delay slot, and make the branch target be the following instruction. This works fine if we can throw away what we just did; if we can't, we have to have a way to unravel it. The Sparc has an extra wrinkle in its branch instructions; you can set a bit that will throw away the instruction in the delay slot if the branch is not taken. This is for exactly this situation.