Paper from Intel on Pentium, Pentium MMX, and Pentium II Pipeline (PDF format)
Agner Fog's excellent information on optimizing for various Pentium processors
Toy "benchmark" program used to illustrate pipeline anomaly later:
Notes on benchmarks, including timing results for the toy code on a number of computers.
IA32 is the name used by Intel to describe their 32-bit instruction set architecture, as introduced in the Intel 386 processor (and substantially enhanced later). Some milestones in Intel history are in the following table (this table only mentions major new processor cores, not smaller enhancements like MMX instructions etc), excerpted from Intel's Processor Hall of Fame and with other information from various places.
|1972||8008||First true microcomputer (I don't count the 4004, though many other people do). Developed as an embedded controller for a computer terminal for Computer Terminal Corporation. However, due to CTC's production timing, the chip wasn't actually used by them; their Datapoint 2200 terminal used about 100 separate components instead of the new microprocessor chip.|
|1974||8080||Slightly "tweaked" improvement on 8008|
|1978||8086||Expansion to 16 bits. Not binary compatible, but 8080 assembly code could be assembled to 8086 code. 8088 version (used an eight bit bus) used as CPU in original IBM PC. In spite of many, many rumors to the effect that IBM would have preferred Motorola's MC68000 processor but Motorola couldn't commit to the volumes required, I haven't been able to find any verification.|
|1982||286||Addition of true segmented memory scheme. Couldn't switch between memory modes without a reboot, so new memory management wasn't used by Microsoft. Consequently, 286 was pretty much used only as a fast 8086.|
|1985||386 (P3)||First IA32 Implementation|
|1993||Pentium (P5)||Dual Pipelines|
|1995||Pentium Pro (P6)||Micro-ops, Out-of-order execution (Intel used this basic core for several generations of marketing, including the Pentium II and Pentium III)|
|2000||Pentium 4 (P7)||Lots of very clever enhancements...|
The Intel 386 instruction set (IA32) was designed as a 32-bit extension to the existing 8086 instruction set (which in turn was designed to make it possible to re-assemble 8080 programs easily, and the 8080 was a slightly tweaked 8008. But contrary to legend, the 8008 was not based on the 4004).
Due to its history, the philosophy behind the IA32 instruction set is very different from that of MIPS.
The instruction encodings are new (and a huge improvement over the 16 bit instruction set), but it used the existing eight registers, extended out to 32 bits. Of these eight registers, four are intended as more-or-less general purpose regs, and four are intended as pointer regs. Even though the assembly language gives the impression that you can use all of the registers almost interchangeably, down at the machine code level there are still vestiges of an accumulator machine, so (for an example) if you perform an arithmetic operation with an immediate operand, the instruction can be one byte shorter if the destination register is the A register (also called the Accumulator).
Here's a picture showing the eight registers:
The easiest way to explain the registers is to describe the 8086 registers first, and then show how the IA32 registers were derived from them. The four sixteen bit "general purpose" 8086 registers were called AX, CX, DX, and BX (I list them in that order because that's how they're encoded in the machine code, for historical reasons going back to the 8080), and the four 8086 pointer registers were SP (the stack pointer), BP (the frame, or "base" in Intel terminology), SI and DI (used as the source and destination in block moves).
The four pointer registers are SP (the stack pointer), BP (the frame, or "base" in Intel terminology, pointer), SI and DI (used as the source and destination indexes for block moves). The limitations on the use of the pointer registers are more severe than those on the general purpose registers; for instance, you can't use indexed addressing using the stack pointer, nor use the base pointer without using indexed addressing.
For the 386, these registers were extended in the most straightforward way possible: when in 16 bit mode, the registers look like I just described. In 32 bit mode, you can still get at the byte-size pieces like before, but when you try to get the word-size version you get a 32-bit register instead.
The register set figure above tries to show how this works. Looking at the first register in the set, for instance, you can access the low-order eight bits as AL, then next eight bits as AH, the combination of AH and AL as AX, and the whole 32 bits as EAX.
There are also an Instruction Pointer (at least somebody calls the PC by the right name!), a flags register used to hold things like condition codes, and a number of other special-purpose registers.
Instructions can be Register-Register, Register-Memory, Memory-Register, or even Memory-Memory (though this last is somewhat rare, and won't turn up in the example later).
Most of the instructions are two bytes: one to specify the instruction, and a second to specify the operand (there may be more bytes after this if the first two bytes specify that more are needed. Also, there may be up to four prefix bytes before the main instruction to modify its behavior; for instance for using a 16 bit word when running in 32 bit mode or to lock the bus for the duration of the instruction for multiprocessor applications). To take a concrete example, let's look at the ADD instruction (note that this will be the "general" ADD instruction; there are specialized, shorter forms with a different opcode for adding an immediate value to the accumulator, and adding an immediate value to a location in memory). The instruction specifies two operands; one must be a register and the other can be virtually anything: a register, an immediate value, or a memory location. It will add the register to the other operand, putting the result either back in the register or in the other operand.
000000dw "mod r/m"where
The mod r/m byte contains three fields referred to as
r/m. They are encoded as
mod reg r/m
mod(two bits) is the addressing mode,
reg(three bits) is a register number to be used for source (or destination, depending on the
dbit in the opcode byte), and
r/m(three bits) either specifies a register number or is used in conjunction with the s-i-b byte (to be described later) to further specify the addressing mode.
reg field of the mod r/m byte specifies one of the eight available
registers, according to the following encoding:
reg w=0 w=1 16 bit 32 bit 000 al ax eax 001 cl cx ecx 010 dl dx edx 011 bl bx ebx 100 ah sp esp 101 ch bp ebp 110 dh si esi 111 bh di edi
The mod and r/m fields specify the addressing mode for the memory operand.
The five mod r/m bits can specify combinations of addressing modes and registers. The possibilities are different depending on whether we're doing 16 bit or 32 bit effective addresses (as a shorthand, I've been describing the machine as having 16-bit or 32-bit mode; actually, the addressing can be 16 or 32 bits, and the operands can be 16 or 32 bits, separately); I'll describe the 32 bit addressing forms.
The two mod bits actually specify the addressing mode:
The r/m bits specify the register using the same encoding as in the reg field, except that trying to use one of the registers marked as "except" in the list above results in a different addressing mode. Trying to specify ebp with mode 00 specifies that there is a 32 bit direct address following the instruction, instead. Trying to specify esp in mode 00, 01, or 10 says we have another address mode byte present, called the s-i-b (scale-index-base) byte. In this case, we use the mod bits from the mod r/b byte together with the base, scale, and index fields from the s-i-b byte to construct the address, as follows.
The two scale bits specify a scaling factor of
The base and index fields in the s-i-b byte specify the base and index register, using the same encoding as in the mod r/m byte.
When an s-i-b byte is used, the mod bits in the mod r/m byte are used to specify one of the following addressing modes:
So, if the s-i-b byte is used, the address arithmetic ends up looking like the following figure:
The intent is that the base register will be a pointer to a record of some sort, such as an activation record (in which case the base register will be bp) or a struct. The displacement is the distance in the record to the start of an array (the three modes available through the mod bits in the mod r/m byte let you use no displacement at all, an eight bit displacement, or a 32 bit displacement, for code compaction). The index is the array index, and the scale is the size of objects within the array.
Trying to do a RR add in a 386 would take 2 cycles. RM add would take 7.
The Intel instruction set was widely regarded as being un-pipeline-able (if that's a word -- and even if it's not). Then, Intel succeeded in pipelining it when they introduced the 486 in 1989.
Due to the small number of registers, RM and MR instructions will be extremely common. This affects the pipeline:
Also, some instructions may spend several cycles in a given stage (causing a stall, of course).
Here's a site with some good Intel pipeline information (also a lot of other stuff...). Also, the Intel Architecture Optimization Manual tells a lot about how to optimize code for the Pentium and its descendants.
The Pentium has two integer and one floating point pipeline (we'll only consider the integer pipes). The integer pipes each look pretty much just like the 486 pipe (though address generation can always be done in just one cycle in the Pentium); one is called U and one V. The processor attempts to find pairs of instructions, which it sends down the pipes together, with the first instruction going down the U pipe and the second down the V. There's a long list of rules on when this is possible... The Pentium also does dynamic branch prediction.
Here's a description of the Pentium pipelines:
First, there is an interaction when either pipe stalls. If the U pipe stalls, so does the V. If the V pipe stalls, the instruction in the U pipe can continue to the WB stage. But, no new instruction can enter either pipe until the V pipe continues.
While most instructions can be handled by either pipeline, there are a number of them that are restricted to one pipe, as follows:
Two instructions can only go down the pipes on the same cycle under certain circumstances:
The Pentium uses a 256-entry branch target buffer, with a 4-way set-associative organization (which will make more sense after we talk about cache). It uses a two-bit scheme, though I've seen some controversy regarding exactly what state machine it implements.
So, how does this affect optimizing our toy program for an old fashioned Pentium?
.L2: cmpl $999999999,-4(%ebp) jle .L5 jmp .L3 .align 16 .L5: .L4: incl -4(%ebp) jmp .L2 .align 16 .L3:
The loop contains four instructions (since the
is executed everying pass, the
jmp doesn't count).
The required time was 45.22 seconds; at 133 MHz, this was six
cycles per loop for a CPI of 1.5. There must be some bad
stalling or misprediction going on....
.L4 incl %eax cmpl $999999999,%eax jle .L4
Time 15.37. 1 billion loop iterations at 133 MHz -> 2.044 cycles/iteration. this is consistent with the incl going through on the first cycle, and the cmpl and jle pairing on the second.
.L4 decl %eax jns .L4
time 37.66. 1 billion loop iterations at 133 MHz -> 5.008 cycles/iteration. This works out to 2.5 cpi...
If I put a nop either before the decl, or between the decl and the jns, it speeds up to match the speed of the -O1 code.
I still don't get it!