|
|
The Alewife Architecture
Alewife is a scalable, shared-memory multiprocessor which supports up to
512 nodes. The individual nodes are connected by a mesh interconnection
network, which allows point-to-point messaging between any two nodes in
the network, thus eliminating much of the network congestion inherent in
bus architectures. Each individual node contains:
-
A Sparcle processor
-
A floating point coprocessor
-
4MB of global shared memory
-
A directory for the shared memory in that node
-
64 KB of direct-mapped cache
-
A Communications and Memory Management Unit (CMMU)
-
An Elko-series mesh routing chip (EMRC) from Caltech
\
What the heck is a Sparcle processor?
Sparcle was derived from an industry standard SPARC processor. In
addition to the standard SPARC functionality, it has additional instructions
which allow it to switch between processes in 11 cycles. This is
partly accomplished by maintaining separate register windows for each active
context. Both the cache and floating point coprocessor are off the
shelf, SPARC compatible components
Communication and Memory Management Unit
The Communications and Memory Management Unit, or CMMU, implements
most of the unique functionality of Alewife. In an Alewife node,
the CMMU is connected directly to the first-level cache bus and serves
much the same functionality as a cache-controller/ memory-management unit
in a uniprocessor. The CMMU fields memory requests from the local
Sparcle processor and, when necessary, it synthesizes the messages that
fetch memory from remote nodes. Communications functionality of the
CMMU includes:
-
Support for distributed, cache-coherent shared memory via the LimitLESS
cache-coherence protocol: the CMMU supports up to five hardware pointers
per memory line for normal data sharing and can invoke software interrupt
handlers to employ additional pointers. Clean data can be fetched
from a neighboring node in 30 cycles.
-
Support for fast user-level messaging with integrated DMA. A simple
message, consisting of a header and one data word, can be launched in seven
cycles.
-
Several mechanisms for latency tolerance, including non-binding software
prefetch and rapid context switching. A remote cache miss is signaled
immediately to the Sparcle processor, which can switch to a new context
in 11 cycles.
|