X1 <- X2 / X3 X4 <- X1 * X0 X0 <- X3 + X5 X6 <- X2 * X3
Here's some code that purports to be MIPS code for the C program in HW4 (since it only ``purports to be,'' you don't need to worry about whether I've got a bug in it!), except that it should only go through the loop body twice:
lui $2, hi(awry) ori $2, $2, lo(awry) ; pointer to awry addi $3, $2, 8 ; end of awry lui $4, hi(histo) ori $4, $4, lo(histo); pointer to histo loop: lw $5, 0($2) ; read awry[i] sll $5, $5, 2 ; convert to index add $5, $4, $5 ; get pointer to location in histo lw $6, 0($5) ; get old value of histo addiu $6, $6, 1 ; increment histo sw $6, 0($5) ; store back addiu $2, $2, 4 ; increment awry pointer bne $2, $3, loop ; if not done, do it again
Now, assume you have a CPU executing MIPS code with the following assumptions:
lw instructions take five cycles, and
sw instructions take four cycles.
In a moment, I'll be asking you some questions about this machine. But first, here's an example showing the execution of the first few instructions.
On the first cycle, instructions 1-4 are fetched into IF/ID.
Instructions 1 and 4 have no dependencies on other instructions, so they start down the pipe. This frees space for two more instructions, so 5 and 6 are fetched.
Instruction 2 (which was held for a cycle) will have had its dependency satisfied by the time it reaches the execute stage, so it starts down the pipe on cycle 3. Instruction 5 can also start down the pipe (it didn't actually have to wait, since it was just fetched last cycle).
Since a second pair of instructions have started, there is room to fetch another pair. I'm not going to show that; it's where the homework starts for you.
As the instructions move on to retirement, notice that 1-5 all take four cycles (though they may be delayed by stalls), and 6 takes five.
One little thing: it would be easy to get a misimpression from this that you can always fetch two new instructions per cycle. You can't; the number you can fetch is determined by the number that go down the pipes. So if only one went down a pipe you could only fetch one; if four went down together you could fetch four.
Now then, on to the questions:
The key is to ask how many instructions can be in the relevant cycle at a time. Since the IF/ID size sets a limit of four, this means it can't need more than four arithmetic pipes, four memory pipes, eight register reads, four register writes, and four memory accesses (this turned out to be a bit ambiguous; I was counting data memory separately from instruction memory, so you could have gotten eight memory reads and four writes by counting instruction reads too.
There are never more than two instructions in a single stage of the arithmetic pipe, nor more than one in a single stage of the memory pipe. So that's all that was really needed.
Just eyeballing it, it looks like the main constraint is from instruction dependencies. But the right way to answer the question is to just issue all the instructions on the first cycle, and see how long it ends up taking:
(I've drawn an antidependency with a dashed arrow on this one). Sure enough, without assuming register renaming, all that extra fetching doesn't do much for us. If we'd done some loop unrolling, though, it would have made a much bigger difference (since we would have gotten rid of the antidependency in the process.