Consider the following block of C code, which looks up 100 items from an array in a table and computes a checksum based on the items found there (this is actually a very practical problem: I'm having to do it for some barcode scanning software I'm writing for Science Fair).
checksum = 0; for (i = 0; i < 100; i++) checksum = (checksum + table[string[i]]) % 256;
Make the following assumptions:
i and checksum are local variables
declared as int, stored in register
$t0 and $t1, respectively.
string and table are both global
variables. string is declared as
char[], while table is declared as
int[]. Following standard C semantics, you don't
need to do bounds checking on the arrays. Unfortunately, the
addresses of both string and table are
too large to fit in 16 bits, so you'll have to construct their
addresses by hand. To simplify things a bit, assume you can say
high(symbol) to mean the high-order 16 bits of a symbol, and
low(symbol) to mean the low-order 16 bits of the symbol. You
can do simple arithmetic involving constants only, and use that
where a symbol would go (so you could say something like
low(sym+1), but not low(sym+i), as
i is a variable).
lw, in which the loaded value is used for
arithmetic in the immediately following instruction, also
requires a one-cycle stall.
rem (remainder)
pseudo-instruction.
On to the problems. Note that I want exact answers to all the questions. When I ask for code size, that means only the code; you don't need to count space for variables.
add $t1, $zero, $zero # checksum = 0
add $t0, $zero, $zero # i = 0
loop lui $t5, high(string)
addi $t5, $t5, low(string)
add $t5, $t5, $t0
lbu $t5, 0($t5) # get string[i]
sll $t5, $t5, 2 # multiply by 4 to do table lookup
lui $t6, high(table)
addi $t6, $t6, low(table)
add $t6, $t5, $t6
lw $t6, 0($t6) # get table[string[i]]
add $t1, $t1, $t6
andi $t1, $t1, 255 # mod 256
addi $t0, $t0, 1 # i = i+1
slti $t2, $t0, 100
bne $t2, $zero, loop # go around again.
There are 16 instructions, so it's 64 bytes. Execution
takes 4 cycles to load the pipeline and 2 for the instructions
before the loop. The loop is executed 100 times; each iteration
takes 14 (for instructions) + 3 (for stalls, except on the last
iteration). The total is 4 + 2 + 100(17) - 1 = 1705
Here's another solution that optimizes a bit, but doesn't get into loop unrolling and so forth.
add $t1, $zero, $zero # checksum = 0
lui $t5, high(string)
addi $t5, $t5, low(string) # get a pointer to the string
addi $t7, $t5, 100 # get a pointer past the end
lui $t6, high(table)
addi $t6, $t6, low(table) # get a pointer to the table
loop lbu $t2, 0($t5) # get string[i]
addi $t5, $t5, 1 # increment the pointer
sll $t2, $t2, 2 # multiply by 4 to do table lookup
add $t3, $t2, $t6
lw $t3, 0($t3) # get table[string[i]]
slt $t4, $t5, $t7 # check to see if this is the last time
add $t1, $t1, $t3 # add to checksum
bne $t2, $zero, loop # go around again.
andi $t1, $t1 255
This one is only 15 instructions, so it's only 60 bytes. Time is
4 to fill the pipeline, 6 before the loop, 100 iterations of 9 cycles
each, and 1 cycle at the end for a total of 911 cycles. Optimizations
include code hoisting and use of a pointer into the array, but not any
loop unrolling nor running the loop backward.
add $t1, $zero, $zero # initialize checksum
lui $t2, high(string) # get pointer to string
addi $t2, $t2, low(string)
lui $t3, high(table) # get pointer to table
addi $t3, low(table)
lbu $t4, 0($t2) # get string[0]
lbu $t5, 1($t2) # get string[1]
sll $t4, $t4, 2 # shift string[0] for table lookup
add $t4, $t3, $t4 # get address into table
lw $t4, 0($t4) # get table[string[0]]
sll $t5, $t5, 2 # shift string[1] for table lookup
add $t5, $t3, $t5 # get address into table
lw $t5, 0($t5) # get table[string[1]]
add $t1, $t1, $t4 # checksum
add $t1, $t1, $t5 # checksum
# the preceding 12 lines are repeated 49 more times, each time
# changing the indexes on the lbu instructions
andi $t1, $t1, 255 # modulus
So... this time we have 5 + 50*10 + 1 = 506 instructions,
taking 2024 bytes. Since we're executing 506 instructions, the
total time is 4 to fill the pipe + 506. I'd say we'd hit some
sort of diminishing returns between my second "unoptimized"
solution and this solution!
add $t1, $zero, $zero # init checksum
lui $t2, high(string) # pointer to string
addi $t2, low(string)
lui $t3, high(table) # get pointer to table
addi $t3, low(table)
addi $t4, $t2, 100 # pointer past end of string
addi $t7, $zero, 0 # so we can add them on the first iteration
addi $t8, $zero, 0
loop add $t1, $t1, $t7 lbu $t5, 0($t4) # checksum from last iteration; get string[i]
add $t1, $t1, $t8 lbu $t6, 1($t4) # checksum from last iteration; get string[i+1]
sll $t5, $t5, 2
add $t5, $t3, $t5
sll $t6, $t6, 2 lw $t7, 0($t5) # table[string[i]]
add $t6, $t3, $t6
addi $t4, $t4, 2 lw $t8, 0($t6) # table[string[i+1]]
slt $t9, $t4, $t2
bne $t9, $zero, loop
add $t1, $t1, $t7 # checksum from last iteration
add $t1, $t1, $t8
andi $t1, $t1, 255 # modulus
Assuming we don't actually need to put no-ops in the code to force
the instructions into the correct pipelines, this is 24 instructions
or 96 bytes. Filling the pipe takes 4 cycles, then 8 cycles for the
preamble, (50 * 10)-1 = 499 for the loop (remember, there is a
one-cycle stall every iteration but the last), and three cycles for
the cleanup code at the end for a total of 506 cycles. It's a pretty
extraordinary coincidence that I got the number of cycles for this
solution as my full-speed non-superscalar. How much better to you
suppose it could be if either pipe could do arithmetic?