Main
Overview
Architecture
Cache Coherence
References and Links

Cache Coherence

Overview

Caches enhance the performance of multiprocessors by reducing network traffic and average memory access latency.  In parallel computing, the problem of cache coherence arises because multiple processors may be reading and modifying the same memory blocks within their own cache.  Common solutions to the cache coherence problem are coherence through bus snooping and directory based coherence.  The nodes in the Alewife machine transmit messages in a point-to-point manner through a mesh interconnection network to reduce network congestion.  Because of this, all nodes are not privy to every message that is exchanged, thus making bus snooping impossible.

One solution involving directory based cache coherence involves using a full map directory protocol.  In this method, each block of memory has an associated directory entry which contains a bit for each cache in the system.  That bit indicates whether or not the associated cache contains a copy of the memory block.  Research has shown that this method is very fast and accurate, however it is prohibitively expensive to implement in a general purpose multiprocessor, and implementing the directories in hardware will greatly affect the scalability of the system.  To quote a member of the Alewife team, "by committing overwhelming resources to cache coherence, it is always possible to achieve good performance". 

Another directory based solution is to us a limited directory protocol.  This method takes advantage of the observation that most parallel algorithms tend to avoid widespread sharing of variables.  Like the full map solution, each memory block has an associated directory entry.  Instead of having a bit for every node in the system, the directory contains a small, fixed number of pointers to each cache that has a copy of the memory block.  When all the pointers in the directory have been used, the older entries must be invalidated to allow new caches to obtain copies of the data.  The advantages of this method are that its perfomance is comparable to that of a full map scheme in cases where there is limited sharing of data between processors.  It is also much cheaper to implement, and much more scalable since the number of entries does not depend upon the number of processors in the system.  A disadvantage is that the protocol is susceptible to thrashing when the number of processors sharing data exceeds the number of pointers in the directory entry.

Both full map and limited directory schemes have their pros and cons.  The Alewife machine attempts to take advantage of the strengths of both methods with a cache coherence scheme called LimitLESS Directories.

Up to Top

LimitLESS Directories

The LimitLESS scheme attempts to combine the full map and limited directory ideas in order to achieve a robust yet affordable and scalable cache coherence solution.  The main idea behind this method is to handle the common case in hardware and the exceptional case in software.  But what exactly does this mean?  LimitLESS stands for Limited directory Locally Extended through Software Support.  Alewife uses limited directories implemented in hardware to keep track of a fixed amount of cached memory blocks.  When the capacity of the directory entry is exceeded, then the directory interrupts the local processor and a full map directory is emulated in software. 

Each Alewife node contains a directory which has one 64 bit wide entry for every memory block in the local portion of the globally shared memory.  The entry has five 9-bit pointers to identify remote nodes caching the block (29 = 512 = max nodes in an Alewife machine) and one additional bit to indicate whether the local node is caching the block.  The remaining bits in the entry record the state of the block, including whether there are additional pointers that have been allocated through software.  When invalidating the pointers in hardware, the protocol must also check whether there are software pointers which must be invalidated as well.

To make this scheme worthwhile, it is assumed that normal usage will not require the software extensions to keep track of cached blocks, however the fast trap design of the Sparcle processor ensures that even cases where software help is needed will not lead to a prohibitive delay in system performance.  Nonetheless, a request that causes five invalidations can be handled in hardware in only 84 cycles, whereas a request requiring six invalidations, which must be handled in software, takes 707 cycles.

The hardware mechanisms that are required to implement the software-extended protocols are as follows:

  • A fast interrupt mechanism: A processor must be able to interrupt application code and switch to software-extension rapidly.
  • Processor to network interface:  In order to emulate the protocol functions normally performed by the hardware directory, the processor must be able to send and to receive messages from the interconnection network
  • Extra directory state:  Each directory entry must hold the extra state necessary to indicate whether the processor is holding overflow pointers.


Up to Top

Memory and Cache States

As seen in the table below, memory blocks can be in one of four states.  Each of these states has a corresponding state for that block in all of the caches it is currently in.
 
Component State Meaning
Memory Read-Only Caches have read-only copies of the data
Read-Write One cache has a read-write copy of the data
Read-Transaction Holding read request, update is in progress
Write-Transaction Holding write request, invalidation is in progress
Cache Invalid Cache block may not be read or written
Read-Only Cache block may be read, but not written
Read-Write Cache block may be read or written

Before any processor modifies a block in an Invalid or Read-Only cache state, it first requests permission from the CMMU that manages the data.  The CMMU then sends invalidations to each of the cached copies and waits for each of the caches to invalidate their copy (change cache state to Invalid) and return an acknowledgment message.  The memory block is in Write-Transaction state while waiting for acknowledgments.  When all the acks are received, the state of the memory block becomes Read-Write and the cache that originated the transaction is sent a message that it has write permission (it is now in Read-Write state).  In a sense, the cache now "owns" the block until another cache requests access to the data. 

When one cache has a block in Read-Write state and another cache requests read privileges for that block, the CMMU sends an update request to the cache that owns the data.  The block that is waiting for data is marked with the Read-Transaction state.  When a cache receives an update request, it invalidates its copy of the data and replies to memory with an update message that contains the modified data so that the original read request can be satisfied.  After the transaction is complete, the original owner's cached block is now Invalid and the new reader's cached block is Read-Only.

Up to Top

The Cache Coherence Protocol

The following diagrams describe the LimitLESS cache coherence protocol.  The state transition diagram specifies the states, the composition of the pointer set (P), and the transitions between the states.  Each transition is labeled with a number that refers to the specification in the table below the diagram.  The table contains the following information:
  1. The input message from a cache that initiates the transaction and the identifier of the cache that sends it
  2. A precondition (if any) for executing the transition
  3. Any directory entry change that the transition may require
  4. The output message or messages that are sent in response to the input message
Note: that certain transitions require the use of an acknowledgment counter (AckCtr), which is used to ensure that cached copies are invalidated before allowing a write transaction to be completed.

As an example, Transition 2 from the Read-Only  to the Read-Write state is taken when cache i requests write permission (WREQ) and the pointer set is empty or contains just cache i (P = {} or P = {i}).  In this case, the pointer set is modified to contain i (if necessary) and the CMMU issues a message containing the data of the block to be written (WDATA).

The extra notation on the Read-Only state (S: n > p) indicates that the state is handled in software when the size of the pointer set (n) exceeds the size of the limited directory entry (p).  In this situation, the transitions with the shaded labels (1,2, and 3) are executed by the interrupt handles on the processor that is local to the overflowing directory.  When the protocol changes from a software-handled state to a hardware-handled state, the processor must modify the directory state so that the CMMU can resume responsibility for the protocol transitions.

Up to Top