Main
Overview
Architecture
Cache Coherence
References and Links

The MIT Alewife Project

Goals

In recent years, large scale multiprocessors have been developed which are capable of truly astounding feats of computing power.  While the best case is always the one that attracts the most attention, it is important to realize that these results are achieved not only from pouring a great deal of effort into the design of the machine.  The problem also generally requires agonizing months of algorithm development, programming, debugging, and relentless tuning.  Worse yet, often a parallel architecture is designed around a specific problem and is then only effective for that type of problem.  Fine, for weather forecasters, but what about the rest of us?

The MIT Alewife machine was designed with the goal of programmability in mind.  The hardware, compiler, and operating system all work together to solve the problems which are traditionally a burden to parallel programmers; namely scheduling computation and moving data between processing elements.  Features of the Alewife system include:

  • A globally shared address space
  • A scalable cache coherence mechanism
  • A compiler that automatically partitions regular programs with loops
  • A library of efficient synchronization and communication routines
  • Distributed garbage collection
  • A parallel debugger
Another goal of the project is scalability.  The architecture aides scalability in that the Alewife machines are built by combining a set of individual, modular processing nodes.  The nodes are connected by a simple, low cost two-dimensional mesh, and VME and SCSI interface boards plug into the edges of the mesh to provide I/O facilities.  Whether an Alewife machine consists of two or 512 nodes, the cost per node remains constant.  This cost was about $2000 per node in the prototype, but could be reduced greatly with volume production.

Up to Top

The Programming Model

While the programmer sees a shared memory programming model, the actual implementation uses message passing to achieve the sharing of data.  Message passing provides a more efficient and scalable architecture as the number of processing nodes in the system becomes large.  Alewife features that help to improve the performance of message passing include:
  • Both system and user code can quickly describe and atomically launch a packet directly into the interconnection network
  • A fast interrupt mechanism speeds message reception
  • A direct memory access (DMA) mechanism allows data to flow between the network and memory
In actuality, programmers are also provided with mechanisms for using message passing.  The efficiency with which Alewife supports the image of shared memory allows the user to choose depending upon the need.

Alewife has compilers for a parallel version of ANSI C and a parallel version of LISP called Mul-T.  For parallel C, Alewife supports the p4 library from Argonne National Laboratory as well as parallel loops and distributed arrays.

Up to Top

Latency Tolerance

Because there is no way to avoid all cache misses, Alewife provides certain mechanisms for minimizing the delay caused by having to fetch data from a remote node.  The Alewife compiler supports prefetching of data, so that the latency can be avoided by requesting data before it is actually needed.  Block multithreading allows the processor to switch to a different thread of execution if the current thread is delayed by a cache miss.  This option is supported by the fast context switching of the Sparcle processor.

Up to Top

Debugging and Tuning

A version of the GNU Debugger (GDB) has been developed for Alewife to support program debugging.  The debugger allows the user to set breakpoints, examine data and registers on individual nodes, and inspect both active and blocked threads.  In parallel programs it is also useful to inspect the execution of programs in order to tune them for maximum performance.  Alewife's LimitLESS cache coherence system can be configured to collect information about which memory locations are being shared and accessed and how that affects performance.  The Communications and Memory Management Unit, which handles memory accesses for its node, also provides extensive facilities for performance monitoring.  It can generate histograms of a variety of hardware events including cache hits and misses, instruction counts, and network throughput statistics.  A graphical user interface is provided for this service which allows a user to access both static and dynamic views of performance data.

Up to Top