Main
Overview
Architecture
Cache Coherence
References and Links

The Alewife Architecture

Alewife is a scalable, shared-memory multiprocessor which supports up to 512 nodes.  The individual nodes are connected by a mesh interconnection network, which allows point-to-point messaging between any two nodes in the network, thus eliminating much of the network congestion inherent in bus architectures.  Each individual node contains:
  • A Sparcle processor
  • A floating point coprocessor
  • 4MB of global shared memory
  • A directory for the shared memory in that node
  • 64 KB of direct-mapped cache
  • A Communications and Memory Management Unit (CMMU)
  • An Elko-series mesh routing chip (EMRC) from Caltech
\

What the heck is a Sparcle processor?

Sparcle was derived from an industry standard SPARC processor.  In addition to the standard SPARC functionality, it has additional instructions which allow it to switch between processes in 11 cycles.  This is partly accomplished by maintaining separate register windows for each active context.  Both the cache and floating point coprocessor are off the shelf, SPARC compatible components
 

Communication and Memory Management Unit


The Communications and Memory Management Unit, or CMMU, implements most of the unique functionality of Alewife.  In an Alewife node, the CMMU is connected directly to the first-level cache bus and serves much the same functionality as a cache-controller/ memory-management unit in a uniprocessor.  The CMMU fields memory requests from the local Sparcle processor and, when necessary, it synthesizes the messages that fetch memory from remote nodes.  Communications functionality of the CMMU includes:

  • Support for distributed, cache-coherent shared memory via the LimitLESS cache-coherence protocol:  the CMMU supports up to five hardware pointers per memory line for normal data sharing and can invoke software interrupt handlers to employ additional pointers.  Clean data can be fetched from a neighboring node in 30 cycles.
  • Support for fast user-level messaging with integrated DMA.  A simple message, consisting of a header and one data word, can be launched in seven cycles.
  • Several mechanisms for latency tolerance, including non-binding software prefetch and rapid context switching.  A remote cache miss is signaled immediately to the Sparcle processor, which can switch to a new context in 11 cycles.