The DLX has two conditional branch instructions, and a small number of unconditional jumps. The conditional branches are a branch on zero, and a branch on non-zero. In the book's first branch implementation, the branch decision is made, and the branch target calculated, in the execute stage. A new value is loaded into the PC in the mem stage, and the branch target fetched in the branch instruction's wb stage.
Again, pursuing a naive strategy, we stop fetching instructions once we've seen a branch, and don't resume until we've decided whether to take it. This results in a three cycle stall whenever a branch occurs; this is called the branch penalty.
In benchmarks, about 19% of this instruction set's executed instructions are branches of one sort or another (see p. 105). If branches are the only thing that cause stalls (big if!), then instructions take .81*1 + .19*4 = 1.57 cpi. That's a 40% performance reduction! We need to do better.
There are two things we can do about the branch penalty: we can try to reduce the penalty, or we can reduce the frequency with which we have to put up with the penalty. The former is a hardware design issue; the latter can be dealt with either in hardware or software.
BxxZ offset
where xx is either EQ or NE (so it's
either a BEQZ or a BNEZ). The offset is
added to PC+4 to generate the target address.
Note the following: the offset is in the instruction, so it is available at decode time. PC+4 is also available by now. So (if we throw some hardware at it) we can be generating the target address at the same time we are deciding whether to take the branch. In fact, we can be doing all this at the same time we are deciding whether the instruction even is a branch at all! We can load the new PC value during the exe stage (one cycle earlier than before), and fetch the new instruction on the mem stage. This reduces the branch penalty to two cycles, and the CPI is improved to 1.38. Definitely worth it.
Actually, I think we can cut this to a one cycle penalty, but the point is made that it can be reduced. The next issue is reducing the number of times a branch penalty has to be paid. If the hardware can make a guess as to whether a branch is likely to be taken, it can proceed on the basis of that prediction until it knows whether the branch will actually be taken. We can have a series of more elaborate schemes for making predictions.
The only problem with this is that most branches are taken. For this architecture on the Spec benchmarks, we see 60% of forward branches are taken, and 85% of backward branches; the average across all branches is about 67%. In some designs, we could take advantage of this by predicting branches are taken; this would reduce or eliminate the branch penalty in 67% of cases instead of 33%. The text makes the claim that this actually won't do us any good in the DLX pipeline since we make the branch decision on the same cycle that we have generated the target address; this actually isn't quite true, since a one-cycle stall is required in determining whether the branch is to be taken if the branch immediately follows the instruction in which the register is set (the text makes the correct observation if the machine uses condition codes, though).