Register Rotation
Home Up

 

Pipeline
EPIC
Instruction Format
Instruction Sequencing
Operating Environments
Predication
Compares
Speculation
Branching
Register Rotation
Other

IA-64 provides for register renaming which makes the registers appear to rotate. Register rotation is provided for the general registers, floating point registers and the predicate registers. Three additional registers, the LC (Loop Count), EC (Epilog Count) and RRB (Register Rotation Base) are also provided.

Register rotation is used for optimizing loops that are both counted or data-terminated. Counted loops are loops whose iterations are known prior to entering the loop, while data-terminated loops are dependent upon values calculated inside the loop.

The general, floating point and predicate registers are divided into subsets of static and rotating sets. The following is the subdivision:

 Register Set  Static  Rotating
 General Registers (GR)  0-31  32-127
 Floating Point Registers (FR)  0-31  32-127
 Predicate Registers (PR)  0-15  16-63

The RRB register is used to rename accesses to the subset of rotating registers. A reference to any register in the range of the rotating registers is offset by the value of the RRB. Thus, if the RRB has a current value of 10, a reference to GR[48] would actually refer to GR[58]. The RRB value 'wraps' register values through the use of modulo arithmetic. Thus, the register values appear to rotate.

Modulo-Scheduling

Modulo-scheduling overlaps multiple iterations of a loop using multiple rotating registers to represent a single variable within the loop. Most compilers currently allow for this type of optimization, but is generally only efficient for large count loops. IA-64 reduces the overhead for modulo-scheduling, making it efficient for small count loops as well.

The prologue of a loop fills the software pipeline, the kernel of the loop executes the loop logic, and the epilogue drains the software pipeline. The following code example extracted from [1] provides insight to the process:

Original code:

for (i=0; i<n; i++) {
     *b++ = *a++;
}  /* copy string */


Compiled Code:

// setup ra, rb, LC, check n>0

prologue
{
     ld8 r33 = [ra], 8
}
kernel
.label loop
{
     ld8 r32 = [ra], 8
     st8 [rb] = r33, 8
     br.ctop #loop
}
epilogue
{
     st8 [rb] = r33, 8
}

Register rotation provides renaming of intermediate results from previous iterations, eliminating register copy operations in the loop body.

Epilogue Count

The EC register is provided to schedule the number of loop iterations required to drain the software pipeline. The LC register determines whether the kernel of the loop will be executed or not. If LC > 0, the LC register is decremented, a 1 is written in predicate register p16, and the kernel is repeated. If LC == 0, then a zero is written to p16, and the EC register is examined. If EC > 1, the EC register is decremented and the kernel of the loop is repeated. Only when LC == 0 and EC == 0 does the loop terminate.

This format allows the loop to be flattened, and the code executed in the epilogue is determined by its predicate. The following is an example of the code above utilizing an epilogue count register.

// setup ra, rb, LC=n-1, EC=2, p16=1, p17=0
.label loop
{
     (p16) ld8  r32 = [ra], 8
     (p17) st8  [rb] = r33, 8
     br.ctop  #loop
}

The following table summarizes the values on execution:

Cycle Stage
Predicate
BR Predicate
Value
LC EC
p16 p17 p16 p17
1 ld1   br.ctop 1 0 n-1 2
2 ld2 st1 br.ctop 1 1 n-2 2
3 ld3 st2 br.ctop 1 1 n-3 2
... ... ... ... ... ... ... ...
n ldn stn-1 br.ctop 1 1 0 2
n+1   stn br.ctop 0 1 0 1
after loop exit 0 0 0 0

After the first cycle executes, the predicate value of p16 rotates into p17, causing the store to be predicated on, and the branch instruction writes a 1 into p16. When cycle n is reached, the branch instruction causes the load instruction to be predicated off.

While Loops

Modulo-scheduling is also provided for data dependent loops, such as the while statement. The loop condition is computed with a predicate register target in the body of the loop, as opposed to a comparison to the LC register. The br.wtop branch instruction will branch until a given predicate is false. Once the predicate is false, the br.wtop instruction continues to loop, but decrements the EC register. Only when the EC register is zero and the predicate is false, will the loop stop.