CS 473 - Intel Information

Paper from Intel on Pentium, Pentium MMX, and Pentium II Pipeline (PDF format)

Agner Fog's excellent information on optimizing for various Pentium processors

Toy "benchmark" program used to illustrate pipeline anomaly later:

Notes on benchmarks, including timing results for the toy code on a number of computers.

Intel IA32 History

IA32 is the name used by Intel to describe their 32-bit instruction set architecture, as introduced in the Intel 386 processor (and substantially enhanced later). Some milestones in Intel history are in the following table (this table only mentions major new processor cores, not smaller enhancements like MMX instructions etc), excerpted from Intel's Processor Hall of Fame and with other information from various places.

YearProcessorImportance
19728008First true microcomputer (I don't count the 4004, though many other people do). Developed as an embedded controller for a computer terminal for Computer Terminal Corporation. However, due to CTC's production timing, the chip wasn't actually used by them; their Datapoint 2200 terminal used about 100 separate components instead of the new microprocessor chip.
19748080Slightly "tweaked" improvement on 8008
19788086Expansion to 16 bits. Not binary compatible, but 8080 assembly code could be assembled to 8086 code. 8088 version (used an eight bit bus) used as CPU in original IBM PC. In spite of many, many rumors to the effect that IBM would have preferred Motorola's MC68000 processor but Motorola couldn't commit to the volumes required, I haven't been able to find any verification.
1982286Addition of true segmented memory scheme. Couldn't switch between memory modes without a reboot, so new memory management wasn't used by Microsoft. Consequently, 286 was pretty much used only as a fast 8086.
1985386 (P3)First IA32 Implementation
1989486 (P4)Pipelined
1993Pentium (P5)Dual Pipelines
1995Pentium Pro (P6)Micro-ops, Out-of-order execution (Intel used this basic core for several generations of marketing, including the Pentium II and Pentium III)
2000Pentium 4 (P7)Lots of very clever enhancements...

Instruction Set

The Intel 386 instruction set (IA32) was designed as a 32-bit extension to the existing 8086 instruction set (which in turn was designed to make it possible to re-assemble 8080 programs easily, and the 8080 was a slightly tweaked 8008. But contrary to legend, the 8008 was not based on the 4004).

Due to its history, the philosophy behind the IA32 instruction set is very different from that of MIPS.

The instruction encodings are new (and a huge improvement over the 16 bit instruction set), but it used the existing eight registers, extended out to 32 bits. Of these eight registers, four are intended as more-or-less general purpose regs, and four are intended as pointer regs. Even though the assembly language gives the impression that you can use all of the registers almost interchangeably, down at the machine code level there are still vestiges of an accumulator machine, so (for an example) if you perform an arithmetic operation with an immediate operand, the instruction can be one byte shorter if the destination register is the A register (also called the Accumulator).

Here's a picture showing the eight registers:

Intel Register Set

The easiest way to explain the registers is to describe the 8086 registers first, and then show how the IA32 registers were derived from them. The four sixteen bit "general purpose" 8086 registers were called AX, CX, DX, and BX (I list them in that order because that's how they're encoded in the machine code, for historical reasons going back to the 8080), and the four 8086 pointer registers were SP (the stack pointer), BP (the frame, or "base" in Intel terminology), SI and DI (used as the source and destination in block moves).

The four pointer registers are SP (the stack pointer), BP (the frame, or "base" in Intel terminology, pointer), SI and DI (used as the source and destination indexes for block moves). The limitations on the use of the pointer registers are more severe than those on the general purpose registers; for instance, you can't use indexed addressing using the stack pointer, nor use the base pointer without using indexed addressing.

For the 386, these registers were extended in the most straightforward way possible: when in 16 bit mode, the registers look like I just described. In 32 bit mode, you can still get at the byte-size pieces like before, but when you try to get the word-size version you get a 32-bit register instead.

The register set figure above tries to show how this works. Looking at the first register in the set, for instance, you can access the low-order eight bits as AL, then next eight bits as AH, the combination of AH and AL as AX, and the whole 32 bits as EAX.

There are also an Instruction Pointer (at least somebody calls the PC by the right name!), a flags register used to hold things like condition codes, and a number of other special-purpose registers.

Instructions can be Register-Register, Register-Memory, Memory-Register, or even Memory-Memory (though this last is somewhat rare, and won't turn up in the example later).

Most of the instructions are two bytes: one to specify the instruction, and a second to specify the operand (there may be more bytes after this if the first two bytes specify that more are needed. Also, there may be up to four prefix bytes before the main instruction to modify its behavior; for instance for using a 16 bit word when running in 32 bit mode or to lock the bus for the duration of the instruction for multiprocessor applications). To take a concrete example, let's look at the ADD instruction (note that this will be the "general" ADD instruction; there are specialized, shorter forms with a different opcode for adding an immediate value to the accumulator, and adding an immediate value to a location in memory). The instruction specifies two operands; one must be a register and the other can be virtually anything: a register, an immediate value, or a memory location. It will add the register to the other operand, putting the result either back in the register or in the other operand.

Op Code Byte

The ADD instruction takes the form
000000dw "mod r/m"
where The "memory" (in quotes because it isn't always actually from memory) operand, specified by the mod r/m byte, is of the general form index + r1 + scale*r2, where any of those fields can actually be missing. "index" is a constant, "scale" is a scale factor of 1, 2, 4, or 8, and r1 and r2 are registers. We'll talk about how this is used later.

The mod r/m byte contains three fields referred to as mod, reg, and r/m. They are encoded as

mod reg r/m
where mod (two bits) is the addressing mode, reg (three bits) is a register number to be used for source (or destination, depending on the d bit in the opcode byte), and r/m (three bits) either specifies a register number or is used in conjunction with the s-i-b byte (to be described later) to further specify the addressing mode.

The 3-bit reg field of the mod r/m byte specifies one of the eight available registers, according to the following encoding:

regw=0w=1
16 bit32 bit
000alaxeax
001clcxecx
010dldxedx
011blbxebx
100ahspesp
101chbpebp
110dhsiesi
111bhdiedi

The mod and r/m fields specify the addressing mode for the memory operand.

The five mod r/m bits can specify combinations of addressing modes and registers. The possibilities are different depending on whether we're doing 16 bit or 32 bit effective addresses (as a shorthand, I've been describing the machine as having 16-bit or 32-bit mode; actually, the addressing can be 16 or 32 bits, and the operands can be 16 or 32 bits, separately); I'll describe the 32 bit addressing forms.

The two mod bits actually specify the addressing mode:

The r/m bits specify the register using the same encoding as in the reg field, except that trying to use one of the registers marked as "except" in the list above results in a different addressing mode. Trying to specify ebp with mode 00 specifies that there is a 32 bit direct address following the instruction, instead. Trying to specify esp in mode 00, 01, or 10 says we have another address mode byte present, called the s-i-b (scale-index-base) byte. In this case, we use the mod bits from the mod r/b byte together with the base, scale, and index fields from the s-i-b byte to construct the address, as follows.

The two scale bits specify a scaling factor of

The base and index fields in the s-i-b byte specify the base and index register, using the same encoding as in the mod r/m byte.

When an s-i-b byte is used, the mod bits in the mod r/m byte are used to specify one of the following addressing modes:

and the index register bits use the same encoding as before, except you can't specify the stack pointer as an index register (trying to do so specifies "no index register").

So, if the s-i-b byte is used, the address arithmetic ends up looking like the following figure:

addr = base + scale*reg

The intent is that the base register will be a pointer to a record of some sort, such as an activation record (in which case the base register will be bp) or a struct. The displacement is the distance in the record to the start of an array (the three modes available through the mod bits in the mod r/m byte let you use no displacement at all, an eight bit displacement, or a 32 bit displacement, for code compaction). The index is the array index, and the scale is the size of objects within the array.

Trying to do a RR add in a 386 would take 2 cycles. RM add would take 7.

486 Pipeline

The Intel instruction set was widely regarded as being un-pipeline-able (if that's a word -- and even if it's not). Then, Intel succeeded in pipelining it when they introduced the 486 in 1989.

Due to the small number of registers, RM and MR instructions will be extremely common. This affects the pipeline:

  1. PF: fetch into 32 byte prefetch buffer. Since the instructions are variable length, it isn't possible to keep fetching a fixed-size chunk for passing down the rest of the pipeline. Instead, the PF stage implements a queue: As instructions are taken from PF and sent down the pipe, PF sends out memory requests to refill the queue.
  2. D1: decode, obtain registers
  3. D2: construct address. For an instruction that does complex addressing, it may be necessary to spend two cycles in this stage (we can infer that there is a single address calculation ALU, and we may need to use it twice.
  4. EX: access cache to obtain operand, execute, and store result back to registers if needed. If the instruction has to do both a cache access and an ALU operation, it may spend two cycles here.
  5. WB: Writeback to cache

Also, some instructions may spend several cycles in a given stage (causing a stall, of course).

Pentium Pipelines

Here's a site with some good Intel pipeline information (also a lot of other stuff...). Also, the Intel Architecture Optimization Manual tells a lot about how to optimize code for the Pentium and its descendants.

The Pentium has two integer and one floating point pipeline (we'll only consider the integer pipes). The integer pipes each look pretty much just like the 486 pipe (though address generation can always be done in just one cycle in the Pentium); one is called U and one V. The processor attempts to find pairs of instructions, which it sends down the pipes together, with the first instruction going down the U pipe and the second down the V. There's a long list of rules on when this is possible... The Pentium also does dynamic branch prediction.

Here's a description of the Pentium pipelines:

Stall Rules

First, there is an interaction when either pipe stalls. If the U pipe stalls, so does the V. If the V pipe stalls, the instruction in the U pipe can continue to the WB stage. But, no new instruction can enter either pipe until the V pipe continues.

Instructions That Can Only Be Handled by One Pipe

While most instructions can be handled by either pipeline, there are a number of them that are restricted to one pipe, as follows:

U Only
shift/rotate, prefixed instructions
V Only
JMP/Call/Jcc, ADC, SBB

Pairing Rules

Two instructions can only go down the pipes on the same cycle under certain circumstances:

  1. both must be "simple." In general, simple instructions are what you'd expect: ADD, SUB, MOV, INC, DEC, etc. Jcc is a simple instruction. ENTER is not.
  2. neither instruction can contain both a displacement and an immediate operand (got to make things a *little* easier for the decision logic!)
  3. no RAW or WAW dependencies between the instructions except for some special cases involving the flags register and the stack pointer:
    1. you can pair a CMP or TEST with a Jcc
    2. you can pair two PUSHes or two POPs
  4. there are additional rules that mainly affect interactions between the CPU and the cache, but they are relatively unimportant. Especially since we haven't talked about cache yet

Branch Prediction Logic

The Pentium uses a 256-entry branch target buffer, with a 4-way set-associative organization (which will make more sense after we talk about cache). It uses a two-bit scheme, though I've seen some controversy regarding exactly what state machine it implements.

So, how does this affect optimizing our toy program for an old fashioned Pentium?

(no optimization)
Since we go through the loop a billion times, only the loop is important. So the part of the code we want to look at is:
.L2:
	cmpl $999999999,-4(%ebp)
	jle .L5
	jmp .L3
	.align 16
.L5:
.L4:
	incl -4(%ebp)
	jmp .L2
	.align 16
.L3:
	  

The loop contains four instructions (since the jle is executed everying pass, the jmp doesn't count). The required time was 45.22 seconds; at 133 MHz, this was six cycles per loop for a CPI of 1.5. There must be some bad stalling or misprediction going on....

-O1
.L4
        incl %eax
        cmpl $999999999,%eax
        jle .L4
	  

Time 15.37. 1 billion loop iterations at 133 MHz -> 2.044 cycles/iteration. this is consistent with the incl going through on the first cycle, and the cmpl and jle pairing on the second.

-O2
.L4
        decl %eax
        jns .L4
	  

time 37.66. 1 billion loop iterations at 133 MHz -> 5.008 cycles/iteration. This works out to 2.5 cpi...

If I put a nop either before the decl, or between the decl and the jns, it speeds up to match the speed of the -O1 code.

I still don't get it!

Last modified: Mon Mar 27 12:54:49 MST 2006