Note: There is a really good tutorial on memory consistency models at ftp://gatekeeper.dec.com/pub/DEC/WRL/research-reports/WRL-TR-95.7.pdf. A great deal of the information in these notes comes from that paper.

# Memory Consistency Models

These notes describe some of the important memory consistency models which have been considered in recent years. The basic point is going to be that trying to implement our intuitive notion of what it means for memory to be consistent is really hard and terribly expensive, and isn't necessary to get a properly written parallel program to run correctly. So we're going to produce a series of weaker definitions that will be easier to implement, but will still allow us to write a parallel program that runs predictably.

## Notation

In describing the behavior of these memory models, we are only interested in the shared memory behavior - not anything else related to the programs. We aren't interested in control flow within the programs, data manipulations within the programs, or behavior related to local (in the sense of non-shared) variables. There is a stnadard notation for this, which we'll be using in what follows.

In the notation, there will be a line for each processor in the system, and time proceeds from left to right. Each shared-memory operation performed will appear on the processor's line. The two main operations are Read and Write, which are expressed as

W(var)value

which means "write value to shared variable var", and

R(var)value

which means "read shared variable var, obtaining value."

So, for instance, W(x)1 means "write a 1 to x" and R(y)3 means "read y, and get the value 3."

More operations (especially synchronization operations) will be defined as we go on. For simplicity, variables are assumed to be initialized to 0.

An important thing to notice about this is that a single high-level language statement (like x = x + 1;) will typically appear as several memory operations. If x previously had a value of 0, then that statement becomes (in the absence of any other processors)

P1:  R(x)0 W(x)1
-----------------

On a RISC-style processor, it's likely that C statement would have turned into three instructions: a load, an add, and a store. Of those three instructions, two affect memory and are shown in the diagram.

On a CISC-style processor, the statement would probably have turned into a single, in-memory add instruction. Even so, the processor would have executed the instruction by reading memory, doing the addition, and then writing memory, so it would still appear as two memory operations.

Notice that the actual memory operations performed could equally well have been performed by some completely different high level language code; maybe an if-then-else statement that checked and then set a flag. If I ask for memory operations and there is anything in your answer that looks like a transformation or something of the data, then something is wrong!

## Strict Consistency

The intuitive notion of memory consistency is the strict consistency model. In the strict model, any read to a memory location X returns the value stored by the most recent write operation to X. If we have a bunch of processors, with no caches, talking to memory through a bus then we will have strict consistency. The point here is the precise serialization of all memory accesses.

We can give an example of what is, and what is not, strict consistency and also show an example of the notation for operations in the memory system. As we said before, we assume that all variables have a value of 0 before we begin. An example of a scenario that would be valid under the strict consistency model is the following:

P1:  W(x)1
-----------------------
P2:        R(x)1 R(x)1

This says, ``processor P1 writes a value of 1 to variable x; at some later time processor P2 reads x and obtains a value of 1. Then it reads it again and gets the same value''

Here's another scenario which would be valid under strict consistency:

P1:        W(x)1
-------------------------------
P2:  R(x)0       R(x)1

This time, P2 got a little ahead of P1; its first read of x got a value of 0, while its second read got the 1 that was written by P1. Notice that these two scenarios could be obtained in two runs of the same program on the same processors.

Here's a scenario which would not be valid under strict consistency:

P1:  W(x)1
-----------------------
P2:        R(x)0 R(x)1

In this scenario, the new value of x had not been propagated to P2 yet when it did its first read, but it did reach it eventually.

I've also seen this model called atomic consistency.

## Sequential Consistency

Sequential consistency is a slightly weaker model than strict consistency. It was defined by Lamport as the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program.

In essence, any ordering that could have been produced by a strict ordering regardless of processor speeds is valid under sequential consistency. The idea is that by expanding from the sets of reads and writes that actually happened to the sets that could have happened, we can reason more effectively about the program (since we can ask the far more useful question, "could the program have broken?"). We can reason about the program itself, with less interference from the details of the hardware on which it is running. It's probably fair to say that if we have a computer system that really uses strict consistency, we'll want to reason about it using sequential consistency

The third scenario above would be valid under sequential consistency. Here's another scenario that would be valid under sequential consistency:

P1:  W(x)1
-----------------------
P2:        R(x)1 R(x)2
-----------------------
P3:        R(x)1 R(x)2
-----------------------
P4:  W(x)2

This one is valid under sequential consistency because the following alternate interleaving would have been valid under strict consistency:

P1:  W(x)1
-----------------------------
P2:        R(x)1       R(x)2
-----------------------------
P3:        R(x)1       R(x)2
-----------------------------
P4:              W(x)2

Here's a scenario that would not be valid under sequential consistency:

P1:  W(x)1
-----------------------
P2:        R(x)1 R(x)2
-----------------------
P3:        R(x)2 R(x)1
-----------------------
P4:  W(x)2

Oddly enough, the precise definition, as given by Lamport, doesn't even require that ordinary notions of causality be maintained; it's possible to see the result of a write before the write itself takes place, as in:

P1:        W(x)1
-----------------------
P2:  R(x)1

This is valid because there is a different ordering which, in strict consistency, would yield P2 reading x as having a value of 1. This isn't a flaw in the model; if your program can indeed violate causality like this, you're missing some synchronization operations in your program. Note that we haven't talked about synchronization operations yet; we will soon.

## Cache Coherence

Most authors treat cache coherence as being virtually synonymous with sequential consistency; it is perhaps surprising that it isn't. Sequential consistency requires a globally (i.e. across all memory locations) consistent view of memory operations, cache coherence only requires a locally (i.e. per-location) consistent view. Here's an example of a scenario that would be valid under cache coherence but not sequential consistency:

P1:  W(x)1 W(y)2
-----------------------
P2:        R(x)0 R(x)2 R(x)1 R(y)0 R(y)1
-----------------------
P3:        R(y)0 R(y)1 R(x)0 R(x)1
-----------------------
P4: W(x)2 W(y)1

P2 and P3 both saw P1's write to x as occurring after P4's (and in fact P3 never saw P4's write to x at all), and saw P4's write to y as occurring after P1's (this time, neither saw P1's write as occurring at all). But P2 saw P4's write to y as occurring after P1's write to x, while P3 saw P1's write to x occurring after P4's write to y.

This couldn't happen with a snoopy-cache based scheme. But it certainly could with a directory-based scheme.

## Do We Really Need Such a Strong Model?

Consider the following situation in a shared memory multiprocessor: processes running on two processors each change the value of a shared variable x, like this:

P1 P2
x = x + 1; x = x + 2;

What happens? Without any additional information, there are four different orders in which the two processes can execute these statements, resulting in three different results:

P1 executes first
x will get a new value of 3.
P2 executes first
x will get a new value of 3.
P1 and P2 both read the data; P1 writes the modified version before P2 does.
x will get a new value of 2.
P1 and P2 both read the data; P2 writes the modified version before P1 does.
x will get a new value of 1.

We can characterize a program like this pretty easily and concisely: it's got a bug. With a bit more precision, we can say it has a data race: there is a variable modified by more than one process in a way such that the results depend on who gets there first. For this program to behave reliably, we have to have locks guaranteeing that one of the processes performs its entire operation before the other one starts.

So... given that we have a data race, and the program's behavior is going to be unpredictable anyway, does it really matter if all the processors see all the changes in the same order? Attempting to achieve strict or sequential consistency might be regarded as trying to support the semantics of buggy programs -- since the result of the program is random anyway, why should we care whether it results in the right random value? But it gets worse, as we consider in the next sections...

## Optimizations and Consistency

Even if the program contains no bugs as written, compilers actually don't support sequential consistency in general (compilers don't see the existence of other processors in general, let alone a consistency model. We can argue that perhaps this argues a need for languages with parallel semantics, but as long as programmers are going to use C and Java for parallel programs we're going to have to support them). Most languages support a semantics in which program order is maintained for each memory location, but not across memory locations; this gives compilers freedom to reorder code. So, for instance, if a program writes two variables x and y, and they do not depend on each other, the compiler is free to write these two values to memory in either order without affecting the correctness of the program. In a parallel environment, however, it is quite likely that a process running on some other processor does depend on the order in which x and y were written.

Two-process mutual exclusion gives a good example of this. Remember the code to enter a critical section is given by

flag[i] = true;
turn = 1-i;
while (flag[1-i] && (turn == (1-i))) ;

If the compiler decides (for whatever reason) to reverse the order of the writes to flag[i] and turn, this is perfectly correct code in a single-process environment but broken in a multiprocessing environment (and, of course, that's the situation that matters).

Worse, since processors support out of order execution, there's no guarantee that the program, as executed, will perform its memory accesses in the order specified by the machine code! Worse, as processors and caches get ever more tightly coupled, and as machines use more and more aggressive instruction reording, these sorts of optimizations can end up happening in hardware with little or no control (it's very easy to imagine a machine finishing the update to turn while it's still setting flag[i] up above, since accessing flag[i] involves access to an array).

This is a little bit of a red herring, since we can require that our compiler perform accesses of shared memory in the order specified by the program (the volatile keyword specifies this). In the case of Intel processors, we can also force some ordering on memory accesses by using the lock prefix on instructions. But notice that what we are doing by adding these keywords and prefixes is establishing places in the code where we care about the precise ordering, and places where we do not. The following memory models expand on this idea.

## Processor Consistency

This model is also called PRAM (an acronym for Pipelined Random Access Memory, not the Parallel Random Access Machine model from computability theory) consistency. It is defined as Writes done by a single processor are received by all other processors in the order in which they were issued, but writes from different processors may be seen in a different order by different processors. The basic idea of processor consistency is to better reflect the reality of networks in which the latency between different nodes can be different.

The last scenario in the sequential consistency section, which wasn't valid for sequential consistency, would be valid for processor consistency. Here's how it could come about, in a machine in which the processors are connected by something more complex than a bus:

1. The processors are connected in a linear array, like this.
2. On the first cycle, P1 and P4 write their values and propagate them.
3. On the second cycle, the value from P1 has reached P2, and the value from P4 has reached P3. They read the values, seeing 1 and 2 respectively.
4. On the third cycle, the values have made it two hops. So now P2 sees 2 and P3 sees 1.

So you can see we meet the "hard" part of the definition (the part requiring writes from a single processor getting seen in-order) somewhat vacuously: P1 and P4 only make one write each, so P2 and P3 end up seeing P1's writes, and P4's writes, in order. But the point of the example is the counterintuitive part of the definition: they don't see the writes from P1 and from P4 as being in the same order.

Here's a scenario which would not be valid for processor consistency:

P1:  W(x)1 W(x)2
----------------------------------
P2:              R(x)2 R(x)1

P2 has seen the writes from P1 in an order different than they were issued.

It turns out that the two-process mutual exclusion code above is broken under processor consistency.

One final note on processor consistency and pram consistency is that some authors make processor consistency slightly stronger than PRAM by requiring PC to be both PRAM consistent and cache coherent.

## Synchronization Accesses vs. Ordinary Accesses

A correctly written shared-memory parallel program will use mutual exclusion to guard access to shared variables. In the first buggy example above, we can guarantee deterministic behavior by adding a barrier to the code, which we'll denote as S for reasons that will become apparent later:

P1 P2
x = x + 1;
S; S;
x = x + 2;

In general, in a correct parallel program we obtain exclusive access to a set of shared variables, manipulate them any way we want, and then relinquish access, distributing the new values to the rest of the system. The other processors don't need to see any of the intermediate values; they only need to see the final values.

With this in mind, we can look at the different types of memory accesses more carefully. Here's a figure that shows a classification of shared memory accesses[Gharachorloo]:

The various types of memory accesses are defined as follows:

Shared Access
Actually, we can have shared access to variables vs. private access. But the questions we're considering are only relevant for shared accesses, so that's all we're showing.
Competing vs. Non-Competing
If we have two accesses from different processors, and at least one is a write, they are competing accesses. They are considered as competing accesses because the result depends on which access occurs first (if there are two accesses, but they're both reads, it doesn't matter which is first).
Synchronizing vs. Non-Synchroning
Ordinary competing accesses, such as variable accesses, are non-synchronizing accesses. Accesses used in synchronizing the processes are (of course) synchronizing accesses.
Acquire vs. Release
Finally, we can divide synchronization accesses into accesses to acquire locks, and accesses to release locks.

Remember that synchronization accesses should be much less common than other competing accesses (if you're spending all your time performing synchronization accesses there's something seriously wrong with your program!). So we can further weaken the memory models we use by treating sync accesses differently from other accesses.

## Weak Consistency

Weak consistency results if we only consider competing accesses as being divided into synchronizing and non-synchronizing accesses, and require the following properties:

1. Accesses to synchronization variables are sequentially consistent.
2. No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere.
3. No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed.

Here's a valid scenario under weak consistency, which shows its real strength:

P1:  W(x)1 W(x)2             S
------------------------------------
P2:              R(x)0 R(x)2 S R(x)2
------------------------------------
P3:                    R(x)1 S R(x)2

In other words, there is no requirement that a processor broadcast the changed values of variables at all until the synchronization accesses take place. In a distributed system based on a network instead of a bus, this can dramatically reduce the amount of communication needed (notice that nobody would deliberately write a program that behaved like this in practice; you'd never want to read variables that somebody else is updating. The only reads would be after the S. I've mentioned in lecture that there are a few parallel algorithms, such as relaxation algorithms, that don't require normal notions of memory consistency. These algorithms wouldn't work in a weakly consistent system that really deferred all data communications until sync points).

## Release Consistency

Having the single synchronization access type requires that, when a synchronization occurs, we need to globally update memory - our local changes need to be propagated to all the other processors with copies of the shared variable, and we need to obtain their changes. Release consistency considers locks on areas of memory, and propagates only the locked memory as needed. It's defined as follows:

1. Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed successfully.
2. Before a release is allowed to be performed, all previous reads and writes done by the process must have completed.
3. The acquire and release accesses must be sequentially consistent.

## One Last Point

It should be pretty clear that a sync access is a pretty heavyweight operation, since it requires globally syncronizing memory. But where the strength of these memory models comes is that the cost of these sync operations isn't any worse than the cost of every memory access in a sequentially consistent system.