CS473 - HW5

Out of Order Execution

Due Monday, October 14, 2002

  1. (20 points) Draw a Gantt (timing) chart, using the same conventions as those on my CDC notes page, showing how long it would take to execute the following code on a CDC:
    
    X1 <- X2 / X3
    X4 <- X1 * X0
    X0 <- X3 + X5
    X6 <- X2 * X3
    
    

    CDC Timing Diagram

    1. The divide has no dependencies, so it's able to go immediately.
    2. The multiply depends on the result of the divide; a second order conflict. It is issued, but can't go until the cycle on which the divide concludes.
    3. The add has no dependencies, and is able to immediately. But it's going to write to X0, which the multiply is going to use as an operand, so its write-back has to delayed until the cycle after instruction can go.
    4. The multiply has no dependencies, so it can go immediately. There are two multiply units, so there is no first-order conflict.

  2. (50 points)

    Here's some code that purports to be MIPS code for the C program in HW4 (since it only ``purports to be,'' you don't need to worry about whether I've got a bug in it!), except that it should only go through the loop body twice:

    	
          lui     $2, hi(awry)
          ori     $2, $2, lo(awry) ; pointer to awry
          addi    $3, $2, 8        ; end of awry
          lui     $4, hi(histo)
          ori     $4, $4, lo(histo); pointer to histo
        
    loop: lw      $5, 0($2)        ; read awry[i]
          sll     $5, $5, 2        ; convert to index
          add     $5, $4, $5       ; get pointer to location in histo
          lw      $6, 0($5)        ; get old value of histo
          addiu   $6, $6, 1        ; increment histo
          sw      $6, 0($5)        ; store back
          addiu   $2, $2, 4        ; increment awry pointer
          bne     $2, $3, loop     ; if not done, do it again
    	
    	

    Now, assume you have a CPU executing MIPS code with the following assumptions:

    In a moment, I'll be asking you some questions about this machine. But first, here's an example showing the execution of the first few instructions.

    execution of first six instructions

    1. On the first cycle, instructions 1-4 are fetched into IF/ID.

    2. Instructions 1 and 4 have no dependencies on other instructions, so they start down the pipe. This frees space for two more instructions, so 5 and 6 are fetched.

    3. Instruction 2 (which was held for a cycle) will have had its dependency satisfied by the time it reaches the execute stage, so it starts down the pipe on cycle 3. Instruction 5 can also start down the pipe (it didn't actually have to wait, since it was just fetched last cycle).

      Since a second pair of instructions have started, there is room to fetch another pair. I'm not going to show that; it's where the homework starts for you.

    4. Instructions three and six have now had their dependencies satisfied, and can proceed. Again, there's room for two more instructions in IF/ID.

    As the instructions move on to retirement, notice that 1-5 all take four cycles (though they may be delayed by stalls), and 6 takes five.

    One little thing: it would be easy to get a misimpression from this that you can always fetch two new instructions per cycle. You can't; the number you can fetch is determined by the number that go down the pipes. So if only one went down a pipe you could only fetch one; if four went down together you could fetch four.

    Now then, on to the questions:

    1. In the assumptions at the start of this problem, I mentioned infinite pipelines, register porting, and so forth. Given the restriction on the size of the IF/ID register, how ``infinite'' do the other resources actually have to be? In particular, what is the maximum number of arithmetic and memory pipelines you can actually need? How many register reads and writes actually have to be possible on a cycle? How many simultaneous memory reads and writes are actually needed?

      The key is to ask how many instructions can be in the relevant cycle at a time. Since the IF/ID size sets a limit of four, this means it can't need more than four arithmetic pipes, four memory pipes, eight register reads, four register writes, and four memory accesses (this turned out to be a bit ambiguous; I was counting data memory separately from instruction memory, so you could have gotten eight memory reads and four writes by counting instruction reads too.

    2. Draw a Gantt (timing) diagram showing how the code I gave you is executed on this machine.
    3. Out-of-order MIPS Execution

    4. How many arithmetic and memory pipelines actually turn out to be needed for this code on this machine?

      There are never more than two instructions in a single stage of the arithmetic pipe, nor more than one in a single stage of the memory pipe. So that's all that was really needed.

    5. Evaluate the extent to which the IF/ID register is a bottleneck in the execution of the code, and the extent to which all available parallelism has been exploited.

      Just eyeballing it, it looks like the main constraint is from instruction dependencies. But the right way to answer the question is to just issue all the instructions on the first cycle, and see how long it ends up taking:

      Unlimited-Fetch MIPS

      (I've drawn an antidependency with a dashed arrow on this one). Sure enough, without assuming register renaming, all that extra fetching doesn't do much for us. If we'd done some loop unrolling, though, it would have made a much bigger difference (since we would have gotten rid of the antidependency in the process.


Last modified: Mon Oct 14 10:31:04 MDT 2002