IA-64 provides for register renaming which makes the registers appear to rotate. Register rotation is provided for the general registers, floating point registers and the predicate registers. Three additional registers, the LC (Loop Count), EC (Epilog Count) and RRB (Register Rotation Base) are also provided.
Register rotation is used for optimizing loops that are both counted or data-terminated. Counted loops are loops whose iterations are known prior to entering the loop, while data-terminated loops are dependent upon values calculated inside the loop.
The general, floating point and predicate registers are divided into subsets of static and rotating sets. The following is the subdivision:
The RRB register is used to rename accesses to the subset of rotating registers. A reference to any register in the range of the rotating registers is offset by the value of the RRB. Thus, if the RRB has a current value of 10, a reference to GR would actually refer to GR. The RRB value 'wraps' register values through the use of modulo arithmetic. Thus, the register values appear to rotate.
Modulo-scheduling overlaps multiple iterations of a loop using multiple rotating registers to represent a single variable within the loop. Most compilers currently allow for this type of optimization, but is generally only efficient for large count loops. IA-64 reduces the overhead for modulo-scheduling, making it efficient for small count loops as well.
The prologue of a loop fills the software pipeline, the kernel of the loop executes the loop logic, and the epilogue drains the software pipeline. The following code example extracted from  provides insight to the process:
The EC register is provided to schedule the number of loop iterations required to drain the software pipeline. The LC register determines whether the kernel of the loop will be executed or not. If LC > 0, the LC register is decremented, a 1 is written in predicate register p16, and the kernel is repeated. If LC == 0, then a zero is written to p16, and the EC register is examined. If EC > 1, the EC register is decremented and the kernel of the loop is repeated. Only when LC == 0 and EC == 0 does the loop terminate.
This format allows the loop to be flattened, and the code executed in the epilogue is determined by its predicate. The following is an example of the code above utilizing an epilogue count register.
The following table summarizes the values on execution:
After the first cycle executes, the predicate value of p16 rotates into p17, causing the store to be predicated on, and the branch instruction writes a 1 into p16. When cycle n is reached, the branch instruction causes the load instruction to be predicated off.
Modulo-scheduling is also provided for data dependent loops, such as the while statement. The loop condition is computed with a predicate register target in the body of the loop, as opposed to a comparison to the LC register. The br.wtop branch instruction will branch until a given predicate is false. Once the predicate is false, the br.wtop instruction continues to loop, but decrements the EC register. Only when the EC register is zero and the predicate is false, will the loop stop.