![]() |
![]() |
|||||||||||||||||||||||||||||
|
Cache CoherenceOverviewCaches enhance the performance of multiprocessors by reducing network traffic and average memory access latency. In parallel computing, the problem of cache coherence arises because multiple processors may be reading and modifying the same memory blocks within their own cache. Common solutions to the cache coherence problem are coherence through bus snooping and directory based coherence. The nodes in the Alewife machine transmit messages in a point-to-point manner through a mesh interconnection network to reduce network congestion. Because of this, all nodes are not privy to every message that is exchanged, thus making bus snooping impossible.One solution involving directory based cache coherence involves using a full map directory protocol. In this method, each block of memory has an associated directory entry which contains a bit for each cache in the system. That bit indicates whether or not the associated cache contains a copy of the memory block. Research has shown that this method is very fast and accurate, however it is prohibitively expensive to implement in a general purpose multiprocessor, and implementing the directories in hardware will greatly affect the scalability of the system. To quote a member of the Alewife team, "by committing overwhelming resources to cache coherence, it is always possible to achieve good performance". Another directory based solution is to us a limited directory protocol. This method takes advantage of the observation that most parallel algorithms tend to avoid widespread sharing of variables. Like the full map solution, each memory block has an associated directory entry. Instead of having a bit for every node in the system, the directory contains a small, fixed number of pointers to each cache that has a copy of the memory block. When all the pointers in the directory have been used, the older entries must be invalidated to allow new caches to obtain copies of the data. The advantages of this method are that its perfomance is comparable to that of a full map scheme in cases where there is limited sharing of data between processors. It is also much cheaper to implement, and much more scalable since the number of entries does not depend upon the number of processors in the system. A disadvantage is that the protocol is susceptible to thrashing when the number of processors sharing data exceeds the number of pointers in the directory entry.
Both full map and limited directory schemes have their pros and cons. The Alewife machine attempts to take advantage of the strengths of both methods with a cache coherence scheme called LimitLESS Directories. LimitLESS DirectoriesThe LimitLESS scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution. The main idea behind this method is to handle the common case in hardware and the exceptional case in software. But what exactly does this mean? LimitLESS stands for Limited directory Locally Extended through Software Support. Alewife uses limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks. When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software.Each Alewife node contains a directory which has one 64 bit wide entry for every memory block in the local portion of the globally shared memory. The entry has five 9-bit pointers to identify remote nodes caching the block (29 = 512 = max nodes in an Alewife machine) and one additional bit to indicate whether the local node is caching the block. The remaining bits in the entry record the state of the block, including whether there are additional pointers that have been allocated through software. When invalidating the pointers in hardware, the protocol must also check whether there are software pointers which must be invalidated as well. To make this scheme worthwhile, it is assumed that normal usage will not require the software extensions to keep track of cached blocks, however the fast trap design of the Sparcle processor ensures that even cases where software help is needed will not lead to a prohibitive delay in system performance. Nonetheless, a request that causes five invalidations can be handled in hardware in only 84 cycles, whereas a request requiring six invalidations, which must be handled in software, takes 707 cycles. The hardware mechanisms that are required to implement the software-extended protocols are as follows:
Memory and Cache StatesAs seen in the table below, memory blocks can be in one of four states. Each of these states has a corresponding state for that block in all of the caches it is currently in.
Before any processor modifies a block in an Invalid or Read-Only cache state, it first requests permission from the CMMU that manages the data. The CMMU then sends invalidations to each of the cached copies and waits for each of the caches to invalidate their copy (change cache state to Invalid) and return an acknowledgment message. The memory block is in Write-Transaction state while waiting for acknowledgments. When all the acks are received, the state of the memory block becomes Read-Write and the cache that originated the transaction is sent a message that it has write permission (it is now in Read-Write state). In a sense, the cache now "owns" the block until another cache requests access to the data. When one cache has a block in Read-Write state and another cache requests read privileges for that block, the CMMU sends an update request to the cache that owns the data. The block that is waiting for data is marked with the Read-Transaction state. When a cache receives an update request, it invalidates its copy of the data and replies to memory with an update message that contains the modified data so that the original read request can be satisfied. After the transaction is complete, the original owner's cached block is now Invalid and the new reader's cached block is Read-Only. The Cache Coherence ProtocolThe following diagrams describe the LimitLESS cache coherence protocol. The state transition diagram specifies the states, the composition of the pointer set (P), and the transitions between the states. Each transition is labeled with a number that refers to the specification in the table below the diagram. The table contains the following information:
|
|||||||||||||||||||||||||||||
![]() |
||||||||||||||||||||||||||||||
| As an example, Transition 2 from the Read-Only to the Read-Write
state is taken when cache i requests write permission (WREQ) and
the pointer set is empty or contains just cache i (P = {} or P =
{i}). In this case, the pointer set is modified to contain
i
(if necessary) and the CMMU issues a message containing the data of the
block to be written (WDATA).
The extra notation on the Read-Only state (S: n > p) indicates that the state is handled in software when the size of the pointer set (n) exceeds the size of the limited directory entry (p). In this situation, the transitions with the shaded labels (1,2, and 3) are executed by the interrupt handles on the processor that is local to the overflowing directory. When the protocol changes from a software-handled state to a hardware-handled state, the processor must modify the directory state so that the CMMU can resume responsibility for the protocol transitions. |
||||||||||||||||||||||||||||||