CS 473 - HW4

More MIPS

Due Wednesday, February 26, 2003

Consider the following block of C code, which looks up 100 items from an array in a table and computes a checksum based on the items found there (this is actually a very practical problem: I'm having to do it for some barcode scanning software I'm writing for Science Fair).


checksum = 0;
for (i = 0; i < 100; i++)
    checksum = (checksum + table[string[i]]) % 256;

Make the following assumptions:

  1. i and checksum are local variables declared as int, stored in register $t0 and $t1, respectively.
  2. string and table are both global variables. string is declared as char[], while table is declared as int[]. Following standard C semantics, you don't need to do bounds checking on the arrays. Unfortunately, the addresses of both string and table are too large to fit in 16 bits, so you'll have to construct their addresses by hand. To simplify things a bit, assume you can say high(symbol) to mean the high-order 16 bits of a symbol, and low(symbol) to mean the low-order 16 bits of the symbol. You can do simple arithmetic involving constants only, and use that where a symbol would go (so you could say something like low(sym+1), but not low(sym+i), as i is a variable).
  3. Assume the "standard" MIPS pipeline (as shown on page 499), additionally assuming any extra data paths needed for instructions that this pipeline can't handle. Also, assume no delayed loads and branches. The main points here are that
    1. you have a five-stage pipeline
    2. a taken branch requires a 1-cycle stall
    3. a lw, in which the loaded value is used for arithmetic in the immediately following instruction, also requires a one-cycle stall.
  4. Use only "real" MIPS instructions, no pseudo-instructions: so, for instance, you can't use the rem (remainder) pseudo-instruction.

On to the problems. Note that I want exact answers to all the questions. When I ask for code size, that means only the code; you don't need to count space for variables.

  1. (20 points) ``Naively'' compile the code sequence shown into MIPS assembly code. By ``naively'' translate, I mean perform a straightforward translation, without thinking about optimizing the code. How large (measured in bytes) is your resulting code? How many cycles does it take to execute, given the assumptions?
    This solution tries not to be willfully dumb, but is pretty much top-of-my-head code.
          add    $t1, $zero, $zero   # checksum = 0
          add    $t0, $zero, $zero   # i = 0
          
    loop  lui    $t5, high(string)
          addi   $t5, $t5, low(string)
          add    $t5, $t5, $t0
          lbu    $t5, 0($t5)         # get string[i]
          
          sll    $t5, $t5, 2         # multiply by 4 to do table lookup
          lui    $t6, high(table)
          addi   $t6, $t6, low(table)
          add    $t6, $t5, $t6
          lw     $t6, 0($t6)         # get table[string[i]]
          
          add    $t1, $t1, $t6
          andi   $t1, $t1, 255       # mod 256
          
          addi   $t0, $t0, 1         # i = i+1
          slti   $t2, $t0, 100
          bne    $t2, $zero, loop      # go around again.
    
          
    There are 16 instructions, so it's 64 bytes. Execution takes 4 cycles to load the pipeline and 2 for the instructions before the loop. The loop is executed 100 times; each iteration takes 14 (for instructions) + 3 (for stalls, except on the last iteration). The total is 4 + 2 + 100(17) - 1 = 1705

    Here's another solution that optimizes a bit, but doesn't get into loop unrolling and so forth.

          add    $t1, $zero, $zero     # checksum = 0
          
          lui    $t5, high(string)
          addi   $t5, $t5, low(string) # get a pointer to the string
          addi   $t7, $t5, 100         # get a pointer past the end
    
          lui    $t6, high(table)
          addi   $t6, $t6, low(table)  # get a pointer to the table
          
    loop  lbu    $t2, 0($t5)           # get string[i]
          
          addi   $t5, $t5, 1           # increment the pointer
          sll    $t2, $t2, 2           # multiply by 4 to do table lookup
    
          add    $t3, $t2, $t6
          lw     $t3, 0($t3)           # get table[string[i]]
          slt    $t4, $t5, $t7         # check to see if this is the last time
    
          add    $t1, $t1, $t3         # add to checksum
    
          bne    $t2, $zero, loop      # go around again.
    
          andi   $t1, $t1 255
    
    This one is only 15 instructions, so it's only 60 bytes. Time is 4 to fill the pipeline, 6 before the loop, 100 iterations of 9 cycles each, and 1 cycle at the end for a total of 911 cycles. Optimizations include code hoisting and use of a pointer into the array, but not any loop unrolling nor running the loop backward.

  2. (20 points) Optimize your code from the previous question for minimum execution time: do absolutely anything you can think of to make the code run as quickly as possible; this typically involves eliminating the loop. Now how large is it? How many cycles does it take?
    The fact that we know how many iterations we'll be taking lets us completely unroll the loop.
          add   $t1, $zero, $zero     # initialize checksum
    
          lui   $t2, high(string)     # get pointer to string
          addi  $t2, $t2, low(string)
    
          lui   $t3, high(table)      # get pointer to table
          addi  $t3, low(table)
          
          lbu   $t4, 0($t2)           # get string[0]
          lbu   $t5, 1($t2)           # get string[1]
          
          sll   $t4, $t4, 2           # shift string[0] for table lookup
          add   $t4, $t3, $t4         # get address into table
          lw    $t4, 0($t4)           # get table[string[0]]
          
          sll   $t5, $t5, 2           # shift string[1] for table lookup
          add   $t5, $t3, $t5         # get address into table
          lw    $t5, 0($t5)           # get table[string[1]]
    
          add   $t1, $t1, $t4         # checksum
          add   $t1, $t1, $t5         # checksum
    
          # the preceding 12 lines are repeated 49 more times, each time
          # changing the indexes on the lbu instructions
    
          andi  $t1, $t1, 255       # modulus
          
    So... this time we have 5 + 50*10 + 1 = 506 instructions, taking 2024 bytes. Since we're executing 506 instructions, the total time is 4 to fill the pipe + 506. I'd say we'd hit some sort of diminishing returns between my second "unoptimized" solution and this solution!
  3. (20 points) Now optimize your code for minimum size. This time, do absolutely anything you can think of to make the code as small as possible. How large is it? How fast is it? The resulting optimized code is typically very similar to a naive compilation
    Hmmm.... I'm not getting any brainstorms to shrink it past the second "unoptimized" solution...
  4. (30 points) Finally, optimize your code for a compromise between speed and size on a dual-pipeline superscalar MIPS, as shown on pages 511 through 514 (use the book version, not my lecture version, since the book version is easier to look up details on). Also as on page 513, you should assume a delayed branch this time. How big is your code? How fast? Note that this last one won't have a unique correct answer; as you found in the previous parts of the problem, you can trade off space vs. time, and I've deliberately not told you exactly how to trade (all the same, a minimum-size answer will likely lose points for being slow, and a minimum-time answer will likely lose points for being big).
    I'll write this down so we can see what's going down which pipeline. Hmmm.... just to give another example of different approaches, I'll unroll two iterations of the loop. Since I'm using pointers into the string array, running the array backwards won't help.
          add  $t1, $zero, $zero                                  # init checksum
    
          lui  $t2, high(string)                                  # pointer to string
          addi $t2, low(string)
    
          lui   $t3, high(table)                                  # get pointer to table
          addi  $t3, low(table)
    
          addi  $t4, $t2, 100                                     # pointer past end of string
    	  
          addi  $t7, $zero, 0                                     # so we can add them on the first iteration
          addi  $t8, $zero, 0
    
    loop  add   $t1, $t1, $t7         lbu   $t5, 0($t4)           # checksum from last iteration; get string[i]
          add   $t1, $t1, $t8         lbu   $t6, 1($t4)           # checksum from last iteration; get string[i+1]
          sll   $t5, $t5, 2
          add   $t5, $t3, $t5
          sll   $t6, $t6, 2           lw    $t7, 0($t5)           # table[string[i]]
          add   $t6, $t3, $t6
          addi  $t4, $t4,  2          lw    $t8, 0($t6)           # table[string[i+1]]
          slt   $t9, $t4, $t2
          bne   $t9, $zero, loop
    
          add   $t1, $t1, $t7                                     # checksum from last iteration
          add   $t1, $t1, $t8
    
          andi  $t1, $t1, 255                                     # modulus
          
    Assuming we don't actually need to put no-ops in the code to force the instructions into the correct pipelines, this is 24 instructions or 96 bytes. Filling the pipe takes 4 cycles, then 8 cycles for the preamble, (50 * 10)-1 = 499 for the loop (remember, there is a one-cycle stall every iteration but the last), and three cycles for the cleanup code at the end for a total of 506 cycles. It's a pretty extraordinary coincidence that I got the number of cycles for this solution as my full-speed non-superscalar. How much better to you suppose it could be if either pipe could do arithmetic?

Last modified: Thu Feb 27 14:28:56 MST 2003