Speculation
Home Up

 

Pipeline
EPIC
Instruction Format
Instruction Sequencing
Operating Environments
Predication
Compares
Speculation
Branching
Register Rotation
Other

Why Use Speculation

Proper instruction scheduling reorders instructions to maximize the efficiency of a processor's resources.

A local scheduler divides a given program in segments called basic blocks. A basic block is a contiguous set of instructions that has a single entry and a single exit point. Branches do not exist within a basic block. A local scheduler is limited to reorganizing the instructions of a basic block within the bounds of the block.

However, a branch instruction typically occurs about once in every eight instructions. Because of the small typical size of basic blocks and other common data dependencies, local schedulers typically have poor performance.

Additionally, latencies to secondary caches, tertiary caches and main memory are continually increasing when expressed as number of clock cycles.

Another critical issue related to loads is the fact that they typically start a chain of computation:

Chain of Computation

  1. Load values
  2. Perform computation
  3. Store results

Control Speculation

IA-64 provides mechanisms for scheduling loads prior to logical stores and branches.

NAT

If a speculative load executes successfully, the memory value is loaded into the target register in the same fashion as a normal load. However, if the load is unsuccessful a condition flag is set as appropriate for the register type.

IEEE encoding of floating point values includes a bit pattern called "NatVal" which is interpreted by IA-64 as an exception. However, since integer registers do not provide this mechanism, IA-64 specifies integer registers to actually be 65-bit registers where the 65th bit signals the exception condition and is known as the NAT (Not A Thing) bit.

Arithmetic operations propagate NAT tokens by performing a logical or of the two operands NAT conditions, and the result of the operation is the value of the destination NAT token.

Only instructions whose destination is a integer or floating point register can propagate the NAT token. The exception are compare operations, which set their predicate targets to false when a NAT exception is encountered.

When a compiler moves a load instruction ahead of the basic block to which it belongs two conditions arise:

  1. It can be proved that the load can execute safely.
  2. A speculative load can be issued, and a speculative check instruction is placed in the original location of the load.

IA-64 explicitly provides a speculation check instruction, chk.s. This instruction has two arguments, a register and a memory address. When the speculation check is encountered, the NAT token of the register is checked. If it is not set, no operation takes place. Otherwise, execution is transferred to the location specified by the second argument. The check does not consume a clock cycle, and subsequent arguments that use the value of the register can execute in the same clock cycle as the check.

The second argument refers to the starting location of recovery code. The compiler is responsible for generating a sequence of code that will successfully execute the load if it speculatively failed, as well as any code that executed on the improper speculative load. The recovery code is responsible for restoring the correctness of the program.

The following example from [1] shows the effect of speculative control code:

Original Order Reordered with Speculation
(p1) br.cond label   ld8.s r1 = [r9]
ld8 r1 = [r9] ld8.s r2 = [r8]
ld8 r2 = [r8] add r3 = r1, r2
add R3 = r1, r2 (p1) br.cond label
chk.s r3, recover

Data Speculation

It is possible that the compiler may need to move a load instruction ahead of a store instruction that may cause a potential conflict. It may be possible for the compiler to prove that no conflict exists, but as is often the case with pointers, the dependency remains ambiguous.

The following is an example from [1]:

unsigned char flag;
int test (int *a, *b)
 {
   if (*a)
        flag += 1;
   return (*b - 1);
 }

It is unlikely that *b will point to the global variable flag. However, the compiler must honor the potential dependency, and schedule the load of b after storing the incremented value of flag. IA-64 provides an instruction for advanced speculative loading allowing the data dependency to be broken, and allowing the scheduling of the load of *b prior to the store of flag. The following is the compiled code:

{
    ld1      rf = [radd_f]   // ld flag
    ld8      ra = [radd_a]   // ld *a
    ld8.a    rb = [radd_b]   // ld.a *b
} {
    cmp.eq   p1 = ra, 0      // cmp *a==0
} {
    (p1) add rf = rf, 1      // flag += 1
} {
    (p1) st1 [radd_f] = rf   // st flag
    ld.c     rb = [radd_b]   // chk *b
    add      ret0 = rb, 1    // *b - 1
    br.ret
}

ALAT

The ALAT (Advanced Load Address Table) detects collisions between advanced loads and stores. The following is an example ALAT table entry:

   Register #      Address      Size  

This entry stores the target register executing the speculative load, a portion of the physical address of the load target, and the byte size of the load quantity.

When an advanced load is executed, it checks for any existing entries for the same register, and overwrites those entries if necessary.

Store instructions pass the physical address to the ALAT which removes any matching entries from the table.

Finally, when the speculative check is performed, the ALAT is searched for an appropriate entry. If it exists, then the speculative load succeeded, and the memory remained unchanged. If there is no entry, it was removed by a store instruction, hence a collision. The speculative check then reloads the value into the target register, and any subsequent instructions with data dependencies must wait until the load is completed.

Additionally, the ALAT requires no state saves on context switch. Upon a context switch, the ALAT is cleared. Upon return from the context switch, any outstanding speculative loads must execute recovery code, but the recovery penalty is considered much smaller than cold cache effects from context switch.

The compiler is also provided with instructions such as ld.c.clr and chk.a.clr for specifying when ALAT entries are no longer needed.