Pipeline Optimization: MIPS branch hardware

The text uses the branch hardware to provide an example of how we can go from a poorly-optimized, slow pipeline to a more-optimized, better pipeline. The branch hardware in figure 6.12 is a very naive implementation, which performs the following operations in each stage:

Instruction Fetch
Fetches Instruction (duh)
Instruction Decode
Reads Registers. Following instruction is fetched at the same time.
Execute
Target address is generated, ALU determines whether registers are equal. Following instruction is now in ID, and a third instruction is in IF.
Memory
PC is loaded from branch target input of mux. Next instruction is in Execute. Instructions currently in EX, ID and IF are cancelled.

This example also serves to introduce us to the idea of a ``branch penalty.'' If the branch is taken, the following three instructions have to be cancelled; this gives us a three-cycle delay in the pipeline whenever a branch is taken.

An implementation's branch penalty may simply be a penalty that happens whether there the branch is taken or not, or it may be divided into separate ``branch taken'' and ``branch not taken'' penalties. In this case, the branch taken penalty is three cycles, while the branch not taken penalty is 0.

Of course, we want to reduce branch penalties to as great an extent as possible. We can do this by analyzing the hardware more carefully, and see if we can move stuff earlier in the pipeline. It turns out we can; let's see how, by asking how early in the pipeline we can do everything. I'll be pulling assumptions about how fast we can do things out of a hat as we do this (but I will claim they're reasonable assumptions. Actually, I suspect the reason the book used such a lousy branch implementation in the first place is so they could show us how to speed it up.)

We can actually calculate the target address in the ID stage: none of the data for the target address generation has to come out of the registers; it's just the old PC+4 (available at the start of the cycle from the IF/ID pipeline register) and a value from the instruction itself (sign-extended and shifted, but those are both very fast operations). We can move the branch target address ALU to the ID stage.

By itself, that wouldn't buy us anything: we're still waiting for the execute stage to compare the register values. But a point that we made earlier, while we were talking about the MIPS instruction set, is that comparing two values for equality is much faster than comparing them for greater or less - so, we can put a little bit of hardware in the ID stage, downstream from the registers themselves. If we make the assumption that reading the registers, and checking them for equality, can be done as quickly as the slowest of the other pipeline stages, then we can do this without slowing the pipeline down at all. If we can't make that assumption, we have to ask how frequent the branches are to see if slowing down the pipe is justified by the improved speed of the branches.

As it happens, when we discuss caches we'll see that a cache lookup requires a read followed by a check for equality (and they also have to put some data through a mux after that, a fact that will become relevant in a second), so the work we've now moved into the ID stage is no greater than the work already required by the IF and MEM stages - and they are coming from a larger memory, so it should even have some time left over.

OK, so at this point we're generating the new target address and checking for equality in the ID stage, but then passing this information through the pipeline register to the Execute stage before loading a new value into the PC. Actually, given that the IF and ID stages already have to account for the time to pass through a mux, we can safely assume we can actually load the branch target address into the PC at the end of the ID stage.

This means that we can reduce the branch taken penalty to 1.

Delayed Branches

Now it's time to mention a little detail that turns up on page 444 of the text: rather than taking a penalty (like we just described), we can go ahead and execute the instruction that we fetched while we were processing the branch. This technique is called a ``delayed branch,'' and is actually used in the MIPS. Notice what happens if a later implementation uses a pipeline in which the branch decision has to be made later (like the original branch pipeline we described before): we end up with both a delayed branch emulating this pipeline, and also a branch penalty. Ow....

Branch Prediction

Notice that our sped-up branch depends on the data for the branch being in registers at the beginning of the ID stage of the branch instruction. If it isn't there yet, we've got a branch (or control) hazard we need to deal with. Naturally, if the data is anywhere in the pipeline, we can handle the situation by forwarding.

But what if one instruction generates a value, and the following instruction uses that value in branching? And what if the instruction preceding the branch instruction is a lw providing the information needed for the branch? In both these cases, the data isn't available yet when we want to take the branch.

The obvious answer is to stall, since we've got a data hazard. In the first case, we need to stall for one cycle, and in the second for two cycles. But, if we think about it a bit, we can make a guess as to whether the branch should be taken or not, act according to the guess, and then cancel instructions in the pipeline if the guess is wrong. Because we have this technique available to us for branches, while we don't for "normal" data hazards, we'll refer to these hazards as branch hazards and use "speculation" combined with "branch prediction" to reduce the penalty.

The simplest branch prediction scheme is a static scheme: always make the same guess. Suppose always assume branch not taken. In this case, when we hit a branch, we go ahead and keep fetching instructions from the sequential instruction stream. If it turns out we did want to branch, we cancel any instructions currently in the pipeline and continue from the branch target (incidentally, while this is simple to describe, drawing it is a mess).

This simple scheme guesses branches are never taken. We can have more sophisticated prediction schemes if we want, though - for instance, a conditional branch to a target address that occurs earlier in the code than the branch instruction is likely to be part of a loop, so it's likely to be taken. So we might guess that a forward branch is not take, and a back branch is - what the right guesses to make would be would have to depend on simulations. One old architecture (the Pyramid) had an extra bit in the branch instructions for the compiler to use in telling it which way to guess!


Last modified: Mon Feb 28 10:28:06 MST 2005