The Itanium (known as Merced while still under development) is Intel's extension into a 64-bit architecture. While AMD is attempting a straightforward extension of the old IA-32 instruction set (the product is known as Sledgehammer), Intel and HP are trying for Something Completely Different. This looks like a very innovative architecture...
One of the facts of life of computers is that, as applications get larger, the computer needs a larger word size. It's hard to imagine that the 4GB per-process space made available by a 32-bit address isn't adequate, but... it isn't. When we consider applications like databases, a larger address space is needed. Consequently, the industry has been in the process of changing to 64 bit architectures for roughly the last decade. The late lamented DEC Alpha was a 64 bit architecture, and there are 64 bit extensions to MIPS and Sparc.
When a company changes the word size of its architecture, it's an opportunity to make changes in the instruction set as well. In Motorola's case, the 6800 family and 68000 family had almost nothing in common; they saw the two architectures as aimed at completely different markets and didn't see a way to lock users in.
Intel has previously made two changes in their word size, and consequently their instruction set. In both cases they've worked hard to maintain their existing customer base, so they've placed a heavy emphasis on backward compatibility: when they went from 8-bit (8080 and 8085) to 16-bit (8086) they used a similar enough instruction set that it was possible to cross-assemble 8080 code for the 8086, and when they went from 16-bit (8086 and 80286) to 32 bit (386) they introduced a backward-compatibility mode so the new processor could actually run machine code for the old processor. In both these cases the amount of hardware they could use in a processor was a severe constraint, and dictated that the new register and instruction sets would closely imitate the old.
Transistor budgets today are huge by almost any standard, so in going to 64 bit Intel has the freedom to use a radically different instruction set and register set, and yet put in a backward compatibility mode that fully emulates IS-32.
Unfortunately, Intel has stumbled quite badly in getting new product out, and they're still working on getting Itanium in production quantities.
Main points:
One of the big differences between IA-64 and other architectures is that instruction-level parallelism is explicitly represented in the instruction set. Intel calls this concept EPIC (Explicitly Parallel Instruction Computing). It's quite similar to an old research topic called Very Long Instruction Word computing; some critics claim it's really the same idea under a new label, but to my mind there is enough difference to warrant the new name.
The main point here is that on a machine like the CDC or the Intel P6, a programmer (or the compiler; from our perspective it doesn't really matter) writes a fully sequential program and relies on the hardware to notice when instructions can really be executed in parallel. The point of EPIC is that it should be possible for the compiler to be brighter than the hardware, so it should be possible for the code to specify what can happen in parallel.
A great deal of explicit instruction level parallelism is built into the concept of instruction groups. An instruction group is a bunch of instructions that is conceptually implemented in parallel: in effect, the compiler collects up a bunch of instructions without dependencies into a ``group.'' A group ends when there is either a branch instruction or a ``stop'' in the code. There is no limit in the architecture as to how many instructions can be in a group.
Conceptually, every instruction in one group must execute before any instruction in the next group, and every instruction in a given group executes simultaneously.
Intel's wording of this is to say that ``every instruction in a given instruction group will behave as though its (register) read occurred after the (state) update of all the instructions from the previous instruction group'' and ``(w)ithin an instruction group, every instruction will behave as though its read of the register state occurred before the update of the register state by any instruction (prior or later) in the instruction group.'' There are a few exceptions to both of these rules, which we don't need to worry about here. Incidentally, the phrase ``as though'' is extremely important: it means you don't really need an infinitely-ported register file; you can actually spend several cycles reading data from registers, and the instructions in the group don't all have to really be executed simultaneously.
Here's an example. Let's say we have the following little bit of code (we'll assume the variables are all in registers):
a = (b + c) * (d + e);
We can translate this into two groups:
add r1=r2,r3
add r4=r5,r6;;
mul r7=r1,r4
Itanium has 128 64-bit integer registers, 128 82-bit floating point registers, 64 1-bit predicate registers, and 8 branch registers.
We'll get into the predicate and branch registers later; right now, lets talk about the integer registers and the register stacks.
The register stack is very similar to a mechanism used in the original RISC, and carried on into the Sparc. I hadn't seen any new architectures with the feature in a while, so I had assumed it was dead...
The basic idea is that there is a register called a Current Frame
Marker (CFM), which controls some register renaming. Registers GR0 to
GR31 are always visible, under those names. However, a register
called the CFM decides where the programmer's idea of register 32 is
in the real register set. An alloc instruction is used
to move this pointer and allocate new registers to a program; another
instruction is used to deallocate the registers.
The point of the mechanism is to avoid unnecessary saves and restores of registers on procedure call and return. Most of the time, a program can just allocate and free registers as needed. The registers are memory-mapped, so if the system runs out of registers (stack overflow) the extras are ``spilled'' into memory, and if a stack underflow occurs they are retrieved from memory.
This mechanism also provides a more efficient way to pass parameters to procedures: pass them on the register stack.
Finally, you can make use of ``register rotation'' to allocate new registers to each iteration of a loop, providing a more compact version of loop unrolling.
One other cute thing the integer registers all have is a NaT (Not a Thing) bit, identifying whether the register contents are valid. This can be used to detect use of uninitialized values. Oh, yes, GR0 always contains 0. Over on the FP side, FR0 always contains 0.0, and FR1 always contains 1.0.
This is a mechanism that gets rid of branches. What happens is that, instead of performing a branch, we can put the result of a comparison into a predicate register. Then, we can use the contents of the predicate register to determine whether to execute a later instruction. Again, from Intel:
Before
if (a > b) c = c + 1
else d = d + e + f
After
pT, pF = compare(a>b)
if (pT) c = c + 1
if (pF) d = d + e + f
The processor makes heavy use of speculation, including both control and data speculation. Control speculation is the execution of an instruction before the branch that guards it. An example from Intel looks like this.
Original code:
if (a > b) load(ld_addr1, target1)
else load(ld_addr2, target2)
Speculative code:
/* off critical path */
sload(ld_addr1, target1)
sload(ld_addr2, target2)
/*
...other operations including uses of target1/target2...
*/
if (a > b) scheck(target1, recovery_addr1)
else scheck(target2, recovery_addr2)
The implementation of this scheme makes use of the NaT bits. If the speculative load fails, the NaT bit for the register is set and a code is placed in the register. The check instructions test the NaT bits and raise an exception if needed.
The NaT bits propagate through later instructions - if you use a register to do some arithmetic, and the register is NaT, then your result is also NaT.
Notice that this requires quite a bit of compiler support - instead of a traditional scheme in which only the load you want takes place, and you use the same register for both paths, in this case two registers get loaded and you use the appropriate one for each path.
Itanium also makes use of data speculation: if a load is moved ahead of a store that may end up referring to the same address, a speculative load is performed instead and a check instruction is placed in the original location of the load.
As for the actual implementation, instructions are combined into bundles (don't confuse bundles with groups! An instruction bundle is not an instruction group). Every instruction (except ``load long immediate'') is 41 bits; an instruction bundle includes three instructions and a five bit template to make a total of 128 bits. Instructions are divided into a half dozen classes (as in arithmetic, memory, etc), and only certain combinations of instructions are possible in the various slots in the bundle (due to the length of the template there are only 32 possible combinations of instruction classes and stops in a bundle). It's possible for instructions in two classes to have identical encodings; the template specifies what instruction class is in each instruction in a bundle and also whether there are any stops dividing instruction groups in the bundle (and where they are if so).
I have copies of some of the documents from Intel's developers site: