Consider the following block of C code, which generates a histogram:
for (i = 0; i < 100; i++) histo[awry[i]] = histo[awry[i]]+1;
Make the following assumptions:
i is a local variable declared as int,
stored in register $t0
awry and histo are both global
variables, declared as int[]. Following standard C
semantics, you don't need to do bounds checking on the arrays.
You can assume you can use awry and
histo as symbols.
lw, in which the loaded value is used for
arithmetic in the immediately following instruction, also
requires a one-cycle stall.
On to the problems:
Here's my naive code. If I'd really worked at emulating gcc, I could have made it quite a bit worse:
addi $t0, $0, 0 ; i = 0
loop: slti $3, $t0, 100 ; i < 100?
beq $0, $3, eloop ; exit loop?
sll $5, $t0, 2 ; convert i to index
lw $6, awry($5) ; get awry[i]
sll $6, $5, 2 ; convert to index
lw $7, histo($6) ; get histo[awry[i]]
addi $7, $7, 1 ; increment
sw $7, histo($6) ; put back
addi $t0, $t0, 1 ; increment i
beq $0, $0, loop ; back to top of loop
eloop:
There are 11 instructions, so it's 44 bytes.
The first instruction is only executed once, and costs one cycle.
The instructions in the loop are executed 100 times; there are ten
instructions in the loop. The two lw's each cause a stall, as
does the beq at the end. On the 101st iteration,
the slti and beq at the top are
executed; the beq at the top costs a stall.
So the total time to execute is
| Initialize | 1 |
| Loop | 100*(10 + 3) |
| Last Pass | 3 |
| Total | 1304 |
The key to the second version is to evaluate what a typical loop iteration will look like. I get the following:
lw $1, loc($0) ; get awry[i]
sll $1, $1, 2 ; convert awry[i] to index
lw $2, histo($1) ; get histo[awry[i]]
addi $2, $2, 1 ; increment histo[awry[i]]
sw $2, histo($1) ; store histo[awry[i]] back
where loc is the location awry[i] and the
blank lines represent stalls. This tells us, by the way, that
the total code size is going to be 5*4*100 = 2000 bytes.
Now, we need to see how to
interleave this iteration with the prior and following
iterations to get rid of the stalls, but without introducing
bugs in the histo increment. We can do this by
putting the lw for awry[i+1] just
after the lw for histo[awry[i]], like this:
lw $1, loc($0) ; get awry[i]
addi $2, $2, 1 ; increment histo[awry[i-1]]
sw $2, histo($1) ; store histo[awry[i-1]] back
sll $1, $1, 2 ; convert awry[i] to index
lw $2, histo($1) ; get histo[awry[i]]
lw $3, loc+4($0) ; get awry[i+1]
addi $2, $2, 1 ; increment histo[awry[i]]
sw $2, histo($1) ; store histo[awry[i]] back
OK, so once we're really under way, we can execute in five cycles per iteration. But can we achieve the steady state without stalls, and can we finish up at the end without stalls? For the startup, we can:
lw $1, awry($0) ; get awry[0]
lw $3, 4+awry($0) ; get awry[1]
sll $1, $1, 2 ; convert awry[0] to index
lw $2, histo($1) ; get histo[awry[0]]
sll $3, $3, 2 ; convert awry[1] to index
addi $2, $2, 1 ; increment histo[awry[0]]
sw $2, histo($1) ; store histo[awry[0]] back
lw $2, histo($3) ; get histo[awry[1]]
lw $1, 8+aawry($2) ; get awry[2]
addi $2, $2, 1 ; increment histo[awry[1]]
sw $2, histo($3) ; store histo[awry[1]] back
At this point we've established the pattern we need for the steady-state condition.
The only catch is on the last iteration: there is no ``next iteration,'' so we're stuck with a 1-cycle stall. So the total is 501 cycles.
It occurred to me after the fact that it might not have been obvious that you can assume things like 4+awry (that's a reasonable assumption, though I don't know if SPIM actually allows it, but the compiler generating the code can certainly generate whatever's needed to get the same effect); in that case, you have to use another register as an index. That register will have to be initialized before the first iteration (4 bytes, 1 cycle), and will have to be incremented every iteration but the last (396 bytes, 99 cycles). In this case the total code size is 2400 and the total time is 601 cycles.
The smallest I could think of here involved running the loop
backwards, putting the test at the bottom, and using an index
directly instead of i. So it looks like this:
addi $t0, $0, 396 ; i = 99;
loop: lw $6, awry($t0) ; get awry[i]
sll $6, $5, 2 ; convert to index
lw $7, histo($6) ; get histo[awry[i]]
addi $t0, $t0, -4 ; decrement i
addi $7, $7, 1 ; increment histo
sw $7, histo($6) ; put back
bgez $t0, loop ; back to top of loop
(I was able to speed it up a smidgeon without changing the size by moving the decrement, so I did). This code has eight instructions so the total size is 32 bytes; the seven instructions in the loop are executed 100 times for 700 cycles; there is one load stall per iteration for another 100 cycles; there is a branch stall on every iteration but the last for another 99 cycles. So I get 900 cycles. Notice that for a simple pipelined machine like this minimizing code size is a good (though not perfect) heuristic for minimizing time; when you get to a superscalar machine the pairing rules make this a much less good heuristic.
I'm going to see how close I can come to unrolling enough to get both pipelines pipelines in use all the time, and call that a good compromise. I'm also going to write it out in two columns so it'll be easier for me to keep track of what's going down which pipeline.
First, let's just take our minimum-size single-pipeline version and send it down the two pipes so we can see what we're looking at. One thing to notice is that the figure on page 512 doesn't show any forwarding paths, but the scheduling done by the authors pretty clearly assumes at least the forwarding we've been using to date. Given that, I think I can assume a forwarding path that allows the result of an instruction in the left pipe to be written to memory by its partner in the right pipe (but you can't use it in an address calculation in the right pipe).
addi $t0, $0, 396 : nop ; i = 99;
loop: nop : lw $6, awry($t0) ; get awry[i]
sll $6, $5, 2 : nop ; convert to index
nop : lw $7, histo($6) ; get histo[awry[i]]
addi $t0, $t0, -4 : nop ; decrement i
addi $7, $7, 1 : sw $7, histo($6) ; increment histo and put back
bgez $t0, loop : nop ; put back and loop
Wow. Code that is pretty well optimized for a single pipeline, and makes virtually no use of the dual pipes! The size is 56 bytes, the time is 601. Still faster than the single-pipe version (it could be argued that I cheated by 99 cycles because of the psychic branch prediction, but we still end up at 700, which is still faster), but how much better can we do? Hmmmm, looking at the right pipe, it's used for three instructions and idle for three; to me, this looks like a good candidate for unrolling by a factor of two. Let's see what happens:
addi $t0, $0, 396 : nop ; i = 99;
loop: nop : lw $6, awry($t0) ; get awry[i]
sll $6, $5, 2 : lw $1, awry-4($t0) ; convert awry[i] to index; get awry[i-1]
nop : lw $7, histo($6) ; get histo[awry[i]]
sll $1, $1, 2 : nop ; convert awry[i-1] to index
addi $7, $7, 1 : sw $7, histo($6) ; increment histo and put back
addi $t0, $t0, -8 : lw $7, histo($1) ; decrement i; get histo[awry[i-1]]
nop : nop ; now that's ugly...
addi $7, $7, 1 : sw $7, histo($6) ; increment histo and put back
bgez $t0, loop : nop ; loop
That looks quite a bit better... the code is only slightly larger at 80 but the time is reduced to 451. What happens if we unroll by four?
addi $t0, $0, 396 : nop ; i = 99;
loop: nop : lw $6, awry($t0) ; get awry[i]
sll $6, $5, 2 : lw $1, awry-4($t0) ; convert awry[i] to index; get awry[i-1]
nop : lw $7, histo($6) ; get histo[awry[i]]
sll $1, $1, 2 : lw $2, awry-8($t0) ; convert awry[i-1] to index; get awry[i-2]
addi $7, $7, 1 : sw $7, histo($6) ; increment histo and put back
sll $2, $2, 2 : lw $7, histo($1) ; convert awry[i-2] to index; get histo[awry[i-1]]
addi $t0, $t0, -16 : lw $3, awry-12($t0); decrement i and get awry[i-3]
addi $7, $7, 1 : sw $7, histo($6) ; increment histo and put back
sll $3, $3, 2 : lw $7, histo($2) ; convert awry[i-3] to index; get histo[awry[i-2]]
nop : nop ; sigh
addi $7, $7, 1 : sw $7, histo($2) ; increment histo and put back
nop : lw $7, histo($3) ; get histo[awry[i-3]]
nop : nop ; sigh
addi $7, $7, 1 : sw $7, histo($3) ; put back
bgez $t0, loop : nop ; loop
Couple of notes: I deliberately put the decrement of i
on the same cycle as the last load from awry, even though I could have
it later, as a reminder that the text changes the semantics of the MIPS a bit for the
superscalar version; the lw will see the old version of $t0,
not the new version. Also, I don't see a way around the pattern at the end
(nop lw nop nop addi sw) without a possible problem on updating
histo if two adjacent elements of awry have the same value.
If I wanted to I could unroll by eight or sixteen; I'd have to devote some more thought
to whether this would actually end up using the pipelines more efficiently. At any rate,
since 100 is divisible by four but not by eight, this would require some cleanup code to
take care of the last few iterations, so unrolling by four seems to be a good compromise.
So I get a size of 128 and a time of 376.
You can use SPIM to debug your code; to do that, you'll need to put in
an extra loop to initialize awry with values.
Unfortunately, SPIM doesn't do instruction cycle counting, so you
can't use it to get the cycle counts.