Classic 3-operand, load-store architecture (with, admittedly, some really weird load and store instructions!). Scoreboarded, but not pipelined.
In 1970, J. E. Thornton (according to Seymour Cray, Mr. Thornton was responsible for "most of the detailed design of the Control Data model 6600" wrote a book called Design of a Computer: The Control Data 6600, which has been the standard reference on the CDC ever since. Long out of print, Mr. Thornton gave Tom Uban (who I know nothing about, except that he pursued the project) permission to scan the book and make it available on the net. There is a copy at http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/DesignOfAComputer_CDC6600.pdf which you can download if you want the full details on this machine.
A couple of warnings: it's an 8 megabyte download so it'll take a while unless you've got broadband, and it's much too big for you to print copies on the department computers.
|Fixed Add||3||Full-word integer add, subtract|
|Shift||3-4||Shift, normalize, pack, unpack, mask|
None of these units are pipelined, but all can run in parallel. The machine is superscalar in the sense that several instructions can be started simultaneously; however, actual instruction issue is one at a time (we'll see in just a moment that the machine separates instruction issue from instruction start).
One of the important features of this machine (and machines like Intel's Pentium Pro and later) is that instructions can be executed in an order other than that specified by the program.
First, it takes a bit of thought to convince yourself that out-of-order execution can be guaranteed correct. It can; the basic notion is that the originally specified execution order is used to establish the dependences and antidependences between the instructions. Once that's been done, any order that satisfies those dependencies will give the same result as the original order.
As we will see here, out-of-order execution introduces significantly more bookkeeping to the process of executing code! The achievements of CDC (with the 6600) and IBM (with the 360/91, using Tomasulo's Algorithm) would be interesting today; considering that they were accomplished with a technology level where individual transistors were packaged in little metal cans they are little short of amazing.
All the same, they still failed to strictly maintain the goal of emulating a purely serial computer. In IBM's case, the normal idea of an interrupt was replaced with the idea of an "approximate" interrupt: when an interrupt occurred, the saved PC was only "approximately" that of the instruction that caused the interrupt. It was entirely possible that several instructions following the instruction causing a trap might be executed before the trap happened. Hard to imagine trying to debug a program that is getting floating point overflows in conditions like that! In CDC's case, they "punted" the idea of interrupts completely; it was possibly the last large computer to not use interrupts. Intel has managed to do precise interrupts in their designs; the key has been to add a whole new pipeline stage that reorders the instructions at the end.
But does instruction reordering really buy us anything? Could we do as well with a non-reordering implementation, if we had sufficiently detailed information on the processor? In the case of straight-line code (code with no branches), yes. It would be possible to place the code in the same order that the processor would end up executing it, which would result in execution just as fast as the out-of-order machine. If branches are present, I doubt it; the optimal order for instructions would depend on the instructions already waiting to be executed, so it would likely be different depending on the path taken to a given point in the code. Though with some loop unrolling, it would be possible to get awfully, awfully close.
Even if that weren't the case, the "detailed knowledge" requirement would be awfully restrictive. First, of course, the manufacturers don't want us to have that much knowledge regarding what's in their pipelines; they're trying to keep that secret from each other. Second, it might reduce the number of different custom-optimized backends necessary for different processors sharing an instruction set. That makes it worth it.
A distributed control algorithm is used. A central "scoreboard" keeps track of reservations among the registers and functional units, and communicates with them as needed. There are four important signals used to communicate:
The functional units do not use a conventional clock, nor microcode. Instead, there is a series of timers which send signals to the internals of the functional units as needed. This series of timers is called the "timing chain" (demonstrating that somebody at CDC had a sense of humor -- a timing chain is the part of an automobile engine that synchronizes the opening and closing of the intake and exhaust valves, and the firing of the ignition spark, with the turning of the crankshaft).
Here's a figure illustrating a generic CDC function unit, showing the timing chain and the sequence of signals going back and forth between the function unit and the central control.
Scoreboard has entry for each functional unit, and for each register.
Functional unit entries mark whether there is an instruction pending for the functional unit (and what it is), which registers are reserved by each functional unit for input, and whether the data is available in them, like this:
Register entries tell which functional unit, if any, has a register reserved for output, like this:
(functional unit conflict)
X6 <- X1 + X2
X5 <- X3 + X4
(result register conflict)
X6 <- X1 + X2
X6 <- X4 * X5
In both cases, the second instruction can be issued on the cycle following completion of the first instruction:
X6 <- X1 / X2
X7 <- X5 + X6 (conflict on this instruction)
X3 <- X2 * X4 (this instruction free to execute)
The CDC also supports forwarding, so data can be written to the second instruction (and it can go) on the same cycle that the first instruction finishes. I'm drawing this as a really stretched first cycle on the addition instruction, and using an arrow to show forwarding.
X3 <- X1 / X2
X5 <- X4 * X3
X4 <- X0 + X6
The third instruction is able to write its results back on the cycle after the second instruction reads its operands. I'm drawing this as a stretched last cycle on the multiplication. Pay close attention to the relative timings of the last cycle of the first instruction, the end of the first cycle of the second instruction, and the end of the last cycle of the third instruction.
Here's an example of how a CDC would compute a variance:
(1 / (N-1)) * (N * sum(x^2) - sum(X) * sum(X))
Suppose we have the following assignment of variables to registers:
Then the code to calculate the variance looks and executes like this:
While this code doesn't end up making very good use of the machine's parallelism, a longer example would show more happening during the divide. As it is, a purely serial implementation without forwarding would have taken 57 cycles, while the real CDC only took 42.