Simple Pipeline

The first way to go about speeding up a uniprocessor is to use pipelining. We do this by identifying the stages in the execution of an instruction (note that if we have simple instructions, this is easier to do - one of the motivations of RISC). In DLX, we can divide instruction execution into

  1. Instruction Fetch
  2. PC Increment
  3. Instruction Decode
  4. Register Fetch
  5. Second Register Fetch (if needed)
  6. Arithmetic
  7. Memory read/write (if needed)
  8. Register write-back

Now, how can we get these steps to operate as quickly as possible? An early observation is that the instruction set is oriented toward instructions that require two source registers and produce one result. Looks really appropriate to do dual-ported reads, so the Register Fetch step can be done in a single cycle.

A second observation is that if we want to fetch two registers, the register numbers will always be in the same place in the instruction. In fact, if we want, we can go ahead and read the registers before we even know what instruction is to be performed, and throw away the data if we don't turn out to need it! This is a point about hardware that tends to be very difficult for computer scientists to grasp: you want to think long and hard about adding extra hardware to the processor. In general, you only want to add it if you're going to use it a LOT. If some feature of your instruction set requires you to add hardware to implement the feature, consider carefully whether the feature is really needed. But once you've added the hardware, using it is free. So it doesn't cost you anything to go ahead and read the registers, whether you need the data or not. So the Insruction Decode and Register Fetch steps can be combined.

Our third observation is that we can actually increment the PC while doing other work. We can send the PC contents to the ALU at the same time we are sending them to the instruction memory, and load the new value into the PC at the same time the instruction is coming out of the instruction memory.

These observations, taken together, lead to pipelining as a low-cost means of improving performance. By identifying the steps in the instruction execution that use different parts of the processor, we can have multiple instructions in execution at a time, in different parts of the processor. The standard pipeline for a Harvard architecture computer uses five stages:

  1. Instruction Fetch
  2. Instruction decode/register read
  3. ALU op
  4. Memory op
  5. Register writeback.
The standard pipeline for a Princeton architecture computer only uses four stages; there is no memory op stage (which causes problems we'll get to in just a moment).