CS 473 - HW4

More MIPS

Due Monday, October 7, 2002

Revision 1

Consider the following block of C code, which generates a histogram:


for (i = 0; i < 100; i++)
    histo[awry[i]] = histo[awry[i]]+1;

Make the following assumptions:

  1. i is a local variable declared as int, stored in register $t0
  2. awry and histo are both global variables, declared as int[]. Following standard C semantics, you don't need to do bounds checking on the arrays. You can assume you can use awry and histo as symbols.
  3. Assume the ``standard'' MIPS pipeline (as shown on page 499), additionally assuming any extra data paths needed for instructions that this pipeline can't handle. Also, assume no delayed loads and branches. The main points here are that
    1. you have a five-stage pipeline
    2. a taken branch requires a 1-cycle stall
    3. a lw, in which the loaded value is used for arithmetic in the immediately following instruction, also requires a one-cycle stall.

On to the problems:

  1. (20 points) ``Naively'' compile the code sequence shown into MIPS assembly code. By ``naively'' translate, I mean perform a straightforward translation, without thinking about optimizing the code. How large (measured in bytes) is your resulting code? How many cycles does it take to execute, given the assumptions?

    Here's my naive code. If I'd really worked at emulating gcc, I could have made it quite a bit worse:

          
            addi    $t0, $0, 0      ;  i = 0
    loop:   slti    $3, $t0, 100    ;  i < 100?
            beq     $0, $3, eloop   ;  exit loop?
            sll     $5, $t0, 2      ;  convert i to index
            lw      $6, awry($5)    ;  get awry[i]
            sll     $6, $5, 2       ;  convert to index
            lw      $7, histo($6)   ;  get histo[awry[i]]
            addi    $7, $7, 1       ;  increment
            sw      $7, histo($6)   ;  put back
            addi    $t0, $t0, 1     ;  increment i
            beq     $0, $0, loop    ;  back to top of loop
    eloop:
          
          

    There are 11 instructions, so it's 44 bytes.

    The first instruction is only executed once, and costs one cycle.

    The instructions in the loop are executed 100 times; there are ten instructions in the loop. The two lw's each cause a stall, as does the beq at the end. On the 101st iteration, the slti and beq at the top are executed; the beq at the top costs a stall.

    So the total time to execute is
    Initialize1
    Loop100*(10 + 3)
    Last Pass3
    Total1304

  2. (20 points) Optimize your code from the previous question for minimum execution time: do absolutely anything you can think of to make the code run as quickly as possible. Continue to assume no delayed loads or branches. Now how large is it? How many cycles does it take? The resulting optimized code is typically very, very different from a naive compilation.

    The key to the second version is to evaluate what a typical loop iteration will look like. I get the following:

          
            lw      $1, loc($0)     ; get awry[i]
        
            sll     $1, $1, 2       ; convert awry[i] to index
            lw      $2, histo($1)   ; get histo[awry[i]]
        
            addi    $2, $2, 1       ; increment histo[awry[i]]
            sw      $2, histo($1)   ; store histo[awry[i]] back
          
          

    where loc is the location awry[i] and the blank lines represent stalls. This tells us, by the way, that the total code size is going to be 5*4*100 = 2000 bytes. Now, we need to see how to interleave this iteration with the prior and following iterations to get rid of the stalls, but without introducing bugs in the histo increment. We can do this by putting the lw for awry[i+1] just after the lw for histo[awry[i]], like this:

          
            lw      $1, loc($0)     ; get awry[i]
            addi    $2, $2, 1       ; increment histo[awry[i-1]]
            sw      $2, histo($1)   ; store histo[awry[i-1]] back
            sll     $1, $1, 2       ; convert awry[i] to index
            lw      $2, histo($1)   ; get histo[awry[i]]
            lw      $3, loc+4($0)   ; get awry[i+1]
            addi    $2, $2, 1       ; increment histo[awry[i]]
            sw      $2, histo($1)   ; store histo[awry[i]] back
          
          

    OK, so once we're really under way, we can execute in five cycles per iteration. But can we achieve the steady state without stalls, and can we finish up at the end without stalls? For the startup, we can:

          
            lw      $1, awry($0)    ; get awry[0]
            lw      $3, 4+awry($0)  ; get awry[1]
            sll     $1, $1, 2       ; convert awry[0] to index
            lw      $2, histo($1)   ; get histo[awry[0]]
            sll     $3, $3, 2       ; convert awry[1] to index
            addi    $2, $2, 1       ; increment histo[awry[0]]
            sw      $2, histo($1)   ; store histo[awry[0]] back
            lw      $2, histo($3)   ; get histo[awry[1]]
            lw      $1, 8+aawry($2) ; get awry[2]
            addi    $2, $2, 1       ; increment histo[awry[1]]
            sw      $2, histo($3)   ; store histo[awry[1]] back
          
          

    At this point we've established the pattern we need for the steady-state condition.

    The only catch is on the last iteration: there is no ``next iteration,'' so we're stuck with a 1-cycle stall. So the total is 501 cycles.

    It occurred to me after the fact that it might not have been obvious that you can assume things like 4+awry (that's a reasonable assumption, though I don't know if SPIM actually allows it, but the compiler generating the code can certainly generate whatever's needed to get the same effect); in that case, you have to use another register as an index. That register will have to be initialized before the first iteration (4 bytes, 1 cycle), and will have to be incremented every iteration but the last (396 bytes, 99 cycles). In this case the total code size is 2400 and the total time is 601 cycles.

  3. (20 points) Now optimize your code for minimum size. This time, do absolutely anything you can think of to make the code as small as possible. How large is it? How fast is it? The resulting optimized code is typically very similar to a naive compilation

    The smallest I could think of here involved running the loop backwards, putting the test at the bottom, and using an index directly instead of i. So it looks like this:

          
            addi    $t0, $0, 396    ;  i = 99;
    loop:   lw      $6, awry($t0)   ;  get awry[i]
            sll     $6, $5, 2       ;  convert to index
            lw      $7, histo($6)   ;  get histo[awry[i]]
            addi    $t0, $t0, -4    ;  decrement i
            addi    $7, $7, 1       ;  increment histo
            sw      $7, histo($6)   ;  put back
            bgez    $t0, loop       ;  back to top of loop
          
          

    (I was able to speed it up a smidgeon without changing the size by moving the decrement, so I did). This code has eight instructions so the total size is 32 bytes; the seven instructions in the loop are executed 100 times for 700 cycles; there is one load stall per iteration for another 100 cycles; there is a branch stall on every iteration but the last for another 99 cycles. So I get 900 cycles. Notice that for a simple pipelined machine like this minimizing code size is a good (though not perfect) heuristic for minimizing time; when you get to a superscalar machine the pairing rules make this a much less good heuristic.

  4. (30 points) Finally, optimize your code for a compromise between speed and size on a dual-pipeline superscalar MIPS, as shown on pages 511 through 514. Also as on page 513, you should assume a delayed branch this time. How big is your code? How fast?

    I'm going to see how close I can come to unrolling enough to get both pipelines pipelines in use all the time, and call that a good compromise. I'm also going to write it out in two columns so it'll be easier for me to keep track of what's going down which pipeline.

    First, let's just take our minimum-size single-pipeline version and send it down the two pipes so we can see what we're looking at. One thing to notice is that the figure on page 512 doesn't show any forwarding paths, but the scheduling done by the authors pretty clearly assumes at least the forwarding we've been using to date. Given that, I think I can assume a forwarding path that allows the result of an instruction in the left pipe to be written to memory by its partner in the right pipe (but you can't use it in an address calculation in the right pipe).

          
            addi    $t0, $0, 396    :  nop                  ; i = 99;
    loop:   nop                     :  lw   $6, awry($t0)   ; get awry[i]
            sll     $6, $5, 2       :  nop                  ; convert to index
            nop                     :  lw   $7, histo($6)   ; get histo[awry[i]]
            addi    $t0, $t0, -4    :  nop                  ; decrement i
            addi    $7, $7, 1       :  sw   $7, histo($6)   ; increment histo and put back
            bgez    $t0, loop       :  nop                  ; put back and loop
          
          

    Wow. Code that is pretty well optimized for a single pipeline, and makes virtually no use of the dual pipes! The size is 56 bytes, the time is 601. Still faster than the single-pipe version (it could be argued that I cheated by 99 cycles because of the psychic branch prediction, but we still end up at 700, which is still faster), but how much better can we do? Hmmmm, looking at the right pipe, it's used for three instructions and idle for three; to me, this looks like a good candidate for unrolling by a factor of two. Let's see what happens:

          
            addi    $t0, $0, 396    :  nop                  ; i = 99;
    loop:   nop                     :  lw   $6, awry($t0)   ; get awry[i]
            sll     $6, $5, 2       :  lw   $1, awry-4($t0) ; convert awry[i] to index; get awry[i-1]
            nop                     :  lw   $7, histo($6)   ; get histo[awry[i]]
            sll     $1, $1, 2       :  nop                  ; convert awry[i-1] to index
            addi    $7, $7, 1       :  sw   $7, histo($6)   ; increment histo and put back
            addi    $t0, $t0, -8    :  lw   $7, histo($1)   ; decrement i; get histo[awry[i-1]]
            nop                     :  nop                  ; now that's ugly...
            addi    $7, $7, 1       :  sw   $7, histo($6)   ; increment histo and put back
            bgez    $t0, loop       :  nop                  ; loop
          
          

    That looks quite a bit better... the code is only slightly larger at 80 but the time is reduced to 451. What happens if we unroll by four?

          
            addi    $t0, $0, 396    :  nop                  ; i = 99;
    loop:   nop                     :  lw   $6, awry($t0)   ; get awry[i]
            sll     $6, $5, 2       :  lw   $1, awry-4($t0) ; convert awry[i] to index; get awry[i-1]
            nop                     :  lw   $7, histo($6)   ; get histo[awry[i]]
            sll     $1, $1, 2       :  lw   $2, awry-8($t0) ; convert awry[i-1] to index; get awry[i-2]
            addi    $7, $7, 1       :  sw   $7, histo($6)   ; increment histo and put back
            sll     $2, $2, 2       :  lw   $7, histo($1)   ; convert awry[i-2] to index; get histo[awry[i-1]]
            addi    $t0, $t0, -16   :  lw   $3, awry-12($t0); decrement i and get awry[i-3]
            addi    $7, $7, 1       :  sw   $7, histo($6)   ; increment histo and put back
            sll     $3, $3, 2       :  lw   $7, histo($2)   ; convert awry[i-3] to index; get histo[awry[i-2]]
            nop                     :  nop                  ; sigh
            addi    $7, $7, 1       :  sw   $7, histo($2)   ; increment histo and put back
            nop                     :  lw   $7, histo($3)   ; get histo[awry[i-3]]
            nop                     :  nop                  ; sigh
            addi    $7, $7, 1       :  sw   $7, histo($3)   ; put back
            bgez    $t0, loop       :  nop                  ; loop
          
          

    Couple of notes: I deliberately put the decrement of i on the same cycle as the last load from awry, even though I could have it later, as a reminder that the text changes the semantics of the MIPS a bit for the superscalar version; the lw will see the old version of $t0, not the new version. Also, I don't see a way around the pattern at the end (nop lw nop nop addi sw) without a possible problem on updating histo if two adjacent elements of awry have the same value. If I wanted to I could unroll by eight or sixteen; I'd have to devote some more thought to whether this would actually end up using the pipelines more efficiently. At any rate, since 100 is divisible by four but not by eight, this would require some cleanup code to take care of the last few iterations, so unrolling by four seems to be a good compromise.

    So I get a size of 128 and a time of 376.

  5. You can use SPIM to debug your code; to do that, you'll need to put in an extra loop to initialize awry with values. Unfortunately, SPIM doesn't do instruction cycle counting, so you can't use it to get the cycle counts.


Last modified: Fri Oct 11 09:21:55 MDT 2002