CS 473 - CDC 6600

Classic 3-operand, load-store architecture (with, admittedly, some really weird load and store instructions!). Scoreboarded, but not pipelined.

In 1970, J. E. Thornton (according to Seymour Cray, Mr. Thornton was responsible for "most of the detailed design of the Control Data model 6600" wrote a book called Design of a Computer: The Control Data 6600, which has been the standard reference on the CDC ever since. Long out of print, Mr. Thornton gave Tom Uban (who I know nothing about, except that he pursued the project) permission to scan the book and make it available on the net. There is a copy at http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/DesignOfAComputer_CDC6600.pdf which you can download if you want the full details on this machine.

A couple of warnings: it's an 8 megabyte download so it'll take a while unless you've got broadband, and it's much too big for you to print copies on the department computers.

Registers

Ten functional units

UnitTimeInstructions
Boolean3
Fixed Add3Full-word integer add, subtract
FP Add4
Shift3-4Shift, normalize, pack, unpack, mask
Multiply(2)10
Divide29
Increment(2)3Address arithmetic
Branch

None of these units are pipelined, but all can run in parallel. The machine is superscalar in the sense that several instructions can be started simultaneously; however, actual instruction issue is one at a time (we'll see in just a moment that the machine separates instruction issue from instruction start).

Why Out-of-Order Execution?

One of the important features of this machine (and machines like Intel's Pentium Pro and later) is that instructions can be executed in an order other than that specified by the program.

First, it takes a bit of thought to convince yourself that out-of-order execution can be guaranteed correct. It can; the basic notion is that the originally specified execution order is used to establish the dependences and antidependences between the instructions. Once that's been done, any order that satisfies those dependencies will give the same result as the original order.

As we will see here, out-of-order execution introduces significantly more bookkeeping to the process of executing code! The achievements of CDC (with the 6600) and IBM (with the 360/91, using Tomasulo's Algorithm) would be interesting today; considering that they were accomplished with a technology level where individual transistors were packaged in little metal cans they are little short of amazing.

All the same, they still failed to strictly maintain the goal of emulating a purely serial computer. In IBM's case, the normal idea of an interrupt was replaced with the idea of an "approximate" interrupt: when an interrupt occurred, the saved PC was only "approximately" that of the instruction that caused the interrupt. It was entirely possible that several instructions following the instruction causing a trap might be executed before the trap happened. Hard to imagine trying to debug a program that is getting floating point overflows in conditions like that! In CDC's case, they "punted" the idea of interrupts completely; it was possibly the last large computer to not use interrupts. Intel has managed to do precise interrupts in their designs; the key has been to add a whole new pipeline stage that reorders the instructions at the end.

But does instruction reordering really buy us anything? Could we do as well with a non-reordering implementation, if we had sufficiently detailed information on the processor? In the case of straight-line code (code with no branches), yes. It would be possible to place the code in the same order that the processor would end up executing it, which would result in execution just as fast as the out-of-order machine. If branches are present, I doubt it; the optimal order for instructions would depend on the instructions already waiting to be executed, so it would likely be different depending on the path taken to a given point in the code. Though with some loop unrolling, it would be possible to get awfully, awfully close.

Even if that weren't the case, the "detailed knowledge" requirement would be awfully restrictive. First, of course, the manufacturers don't want us to have that much knowledge regarding what's in their pipelines; they're trying to keep that secret from each other. Second, it might reduce the number of different custom-optimized backends necessary for different processors sharing an instruction set. That makes it worth it.

Distributed Control

A distributed control algorithm is used. A central "scoreboard" keeps track of reservations among the registers and functional units, and communicates with them as needed. There are four important signals used to communicate:

Issue
Sent from scoreboard to functional unit when instruction is issued.
Go
Sent from scoreboard to functional unit when all operands are available and instruction is ready to go. This can be sent on the same cycle as Issue.
Request Release
Sent from functional unit to scoreboard when instruction is finished and results are ready. The time required by a functional unit to perform an instruction is measured from "Go" to "Request Release."
Release
Sent from scoreboard to functional unit when results can be released to registers (and, simultaneously, to any waiting instructions). Another functional unit can use the results (and get an Issue) on the same cycle they are written. The functional unit can be reserved for another instruction on the following cycle. This can be sent on the same cycle as Ready.

The functional units do not use a conventional clock, nor microcode. Instead, there is a series of timers which send signals to the internals of the functional units as needed. This series of timers is called the "timing chain" (demonstrating that somebody at CDC had a sense of humor -- a timing chain is the part of an automobile engine that synchronizes the opening and closing of the intake and exhaust valves, and the firing of the ignition spark, with the turning of the crankshaft).

Here's a figure illustrating a generic CDC function unit, showing the timing chain and the sequence of signals going back and forth between the function unit and the central control.

CDC function unit

Scoreboard

Scoreboard has entry for each functional unit, and for each register.

Functional unit entries mark whether there is an instruction pending for the functional unit (and what it is), which registers are reserved by each functional unit for input, and whether the data is available in them, like this:

UnitResOp1R1Op2R2
Boolean     
Fixed Add     
FP Add     
Shift     
Multiply(2)     
Divide     
Increment(2)     
Branch     

Register entries tell which functional unit, if any, has a register reserved for output, like this:

ABX
0   
1   
2   
3   
4   
5   
6   
7   

Steps in instruction execution:

  1. check availability of functional unit and result register. If either is already reserved, need to wait for it to come available (this is called a "first order conflict" -- today, we'd call it a structural or WAW hazard, depending on where the conflict occurred) New reservations get stalled.

    (functional unit conflict)
    X6 <- X1 + X2
    X5 <- X3 + X4

    (result register conflict)
    X6 <- X1 + X2
    X6 <- X4 * X5

    In both cases, the second instruction can be issued on the cycle following completion of the first instruction:

    First Order Conflict

  2. enter reservations for functional unit and result register. If one or both source register is reserved, the instruction cannot be issued, but the machine can keep entering reservations ("second order" conflict -- RAW hazard).

    X6 <- X1 / X2
    X7 <- X5 + X6 (conflict on this instruction)
    X3 <- X2 * X4 (this instruction free to execute)

    The CDC also supports forwarding, so data can be written to the second instruction (and it can go) on the same cycle that the first instruction finishes. I'm drawing this as a really stretched first cycle on the addition instruction, and using an arrow to show forwarding.

    Second Order Conflict

  3. when source registers contain valid data, read the data and issue the instruction to the functional unit. The functional unit now executes the instruction under local control.
  4. when the functional unit has completed the instruction, it checks to see if it can write its output to its result register (this is impossible if the register is reserved as a source by another functional unit, and that functional unit already has it marked as available -- "third order" conflict -- WAR hazard).

    X3 <- X1 / X2
    X5 <- X4 * X3
    X4 <- X0 + X6

    The third instruction is able to write its results back on the cycle after the second instruction reads its operands. I'm drawing this as a stretched last cycle on the multiplication. Pay close attention to the relative timings of the last cycle of the first instruction, the end of the first cycle of the second instruction, and the end of the last cycle of the third instruction.

    Third Order Conflict

Example

Here's an example of how a CDC would compute a variance:

(1 / (N-1)) * (N * sum(x^2) - sum(X) * sum(X))

Suppose we have the following assignment of variables to registers:
VariableRegister
1X0
NX1
sum x^2X2
sum xX3

Then the code to calculate the variance looks and executes like this:

CDC Variance Code

While this code doesn't end up making very good use of the machine's parallelism, a longer example would show more happening during the divide. As it is, a purely serial implementation without forwarding would have taken 57 cycles, while the real CDC only took 42.


Last modified: Wed Mar 15 11:16:12 MST 2006