The main purpose of a virtual memory system is to allow the software to more efficiently manage the memory resource (it is not to make memory look bigger: there have been systems built that actually had a larger physical memory than virtual memory space! This is the case with recent Intel processors with Physical Address Extension, which will allow a 64 GB physical memory on a machine with only a 4GB virtual space). The goal of a page replacement strategy is to have the memory that is being used a lot cached in memory, and what isn't being used a lot stuck out on disk.
So... we need to keep track of several things to do an effective job of VM management:
Virtual memory regions in use by each process. Each process will have areas marked that they can access, and information regarding what they can do with those marked regions. A page in the process's virtual memory space may be mapped to a page frame in physical memory at the moment, or it may be saved out to the swap area (also sometimes called the backing store) on disk.
When a process tries to access (either read or write) an address in its virtual space, three things can happen:
Of these possibilities, the first is by far the most common. The second should be extremely rare; it will normally only happen due to a bug in a program. The third may happen due to a program bug (trying to access memory that isn't in a process's memory region), or because the page is out in the swap space. This is the interesting case.
In addition to these per-process data structures, the OS also needs to keep global track of the state of the page frames. At its coarsest, it has to keep track of which page frames have pages mapped to them, and which are free.
When the OS first boots, very few of the page frames will be in use. Basically, just the ones holding the OS itself. As new processes are started and executed, the number of free page frames goes down; when processes die, their page frames are released and become free again.
If a process has a page fault due to some of its space being out on swap, we need to allocate a page frame for it, copy the data from backing store to the newly allocated page frame, set up a VM mapping for it, and reexecute the instruction that faulted.
Once again, things can be easy or they can be difficult. The easy case is when we have some free page frames we can allocate to the process. In that case, we can just pick one at random and give it. But this doesn't happen very often -- normally, all the page frames have been allocated to processes, and we have to take a frame away from another process to give it to the process that faulted. As you can imagine, if we make bad choices then we'll have a lot of faults; handling a page fault is such a time-consuming process that in this case virtually no work gets done. This condition is called "thrashing."
There are many page replacement strategies in use, with various strengths and weaknesses.
The good news is, there is a provably optimal page replacement
strategy. The bad news is, it can't be implemented (in general).
Belady proved that the number of page faults is minimized if, on each
page fault, we preempt the page that will not be accessed for the
longest time into the future. While it's pretty easy to see that it's
impossible, it's worthwhile to show the argument: if it were
possible, then we could write a function that would tell us. We'll
call the function longest_time(), and assume that it
returns the first address on the page. Now all we do is write a
little program that calls longest_time(), and then
immediately accesses that address. So we can always write a program
that defeats the function (this is the same basic argument that shows
that the ``halting problem'' is unsolvable).
Notice, though, that just because we can't always find the optimal page to replace doesn't mean we can't ever find it. In particular, a given process may have a pretty good idea of its own access patterns, and be able to make an intelligent choice as to which page it should give up. There have been a very few operating systems that allowed a process to do this; Mach may be the best known (I don't know whether Mach, as used in Macintosh OS X, retains this ability). BSD Unix had a call that was supposed to allow a process to announce what its access patterns would be so the OS could take that into account in making its decisions, but it wound up actually gettin ignored.
In what follows, we'll be pursuing the general case, in which the OS has to make decisions on what page to replace based solely on history.
What we'll do is take advantage of a phenomenon called ``locality.'' There are two forms of locality:
We can see this pretty easily if we think about how a program executes. After any given instruction, the single instruction you are most likely to access next is the next instruction in the code stream (spatial locality). Any program long enough to be interesting contains either loops or recursion (temporal locality). We can see it really clearly in the case of activation records on the stack: the further down the stack an activation record is, the longer it's going to be before we access it. In a language like C this is an absolute; in a language like Pascal (which supports nesting of procedures) it's remotely possible to access activations between the current one and the program globals. But in practice it turned out that this was almost never done.
This tells us that we can make the following guess: the pages that have not been accessed for the longest time in the past are likely to be the ones that we will not access for the longest time into the future. So this gives us our first algorithm approximating Belady's Algorithm: Least Recently Used. We will select the Least Recently Used page as the one to toss. Unfortunately, we can't implement this algorithm either: to do it, we'd need to have some way, in hardware, of keeping track of the order of page references. This isn't practical.
So, how can we approximate LRU (notice that we've now gotten to an approximation of an approximation -- not real comforting, but the best we've got). Here's a first idea: something that we can keep track of in software is when a page was first brought into memory! We can use a scheme called first in, first out: the page that has been in memory the longest is the one we select to take out. While easy to implement, this scheme turns out to be a disaster: there's almost no correlation between when a page was brought into memory and when it was last accessed. Oops.
The simplest algorithm in this class is called the Clock algorithm. First, we assume that each page table entry has a Referenced bit, which is set to 1 by the hardware whenever a page is accessed, but which can be cleared by software.
The way the Clock algorithm works is to maintain a notion of a ``current page.'' Whenever a page fault occurs, we examine the current page's Referenced bit. If it is set we clear it, and move on to the next page. If it's clear, we have a victim. Eventually, we will find a victim (at worst, we'll wrap all the way around and select the page we started with). The pages that haven't been referenced since the last page fault are, of course, not recently used.
There are a lot of variations on Clock; here are a couple.
We can have the clock tick at regular intervals, instead of whenever
there is a page fault. Whenever we find a 0 referenced
bit, we'll put the page at the tail of a "victim list" and mark it as
not present (even though it's really still in memory) in the page
table. When we reference the page we'll get a fault, but the page
will actually be in memory so the fault isn't very expensive; we'll just
mark it as present, take it off the victim list, and put it back in
the clock list. Since pages are taken back off the victim list
whenever they are accessed, the pages that remain on the victim list
tend to get a close approximation to LRU order.
As the size of physical memory gets large, it takes longer and longer to make it all the way through all the pages with the clock algorithm. If it takes too long, we end up with virtually all of the pages having been touched between successive clock sweeps, pretty much no matter how frequently they're accessed. So, we can shorten the "fuse" on how long a page has to go unreferenced before being put on the victim list by using a two-handed clock (no, they aren't an hour hand and a minute hand!).
Instead of having a single clock hand, we can use one hand to clear reference bits and a second hand following along behind examining them, to decide who goes on the victim list. This is the algorithm used by Solaris.
Notice that two-handed clock can be used together with a victim list.
Question: is it more important to minimize global page faults (at the expense of one ill-behaved program causing poor behavior for all other programs), or to minimize page faults for each program individually (so everybody but the ill-behaved program avoids page faults, but the total disk activity is higher and disk i/o ends up having to queue up with page replacements)?
Most real paging algorithms try to compromise between local and global replacement. They do this by doing the following:
One particular implementation of this scheme is Denning's Working Set Strategy. He defines a process's working set at time t as the set of pages used within some time delta. We remove pages from a process's working set as they get too old. The text also mentions some other working set algorithms.
Within each process, a Two-hand Clock is used to find targets for paging out. Two-hand Clock uses two pointers, called the front hand and the back hand. The front hand sweeps through the process's pages setting the referenced bit to 0. Some time later, the back hand sweeps through, checking the referenced bits. Any page whose referenced bit is still 0 is placed on a list to be swapped out (the sweep rate and hand spread are both configurable parameters). If a page that has been scheduled for paging out is referenced before it is actually written, it is rescued.
There are a number of heuristics that are applied: a process is not permitted to grab more pages than some threshold. A process is not allowed to be reduced to too small a number of pages. If the page fault rate is too high, the kernel may decide to take a whole process and swap it out (hopefully this only happens to idle processes...).
One other thing to notice about this is that it never decides that some pages are just plain too old and tosses them out. This in turn means that looking at the amount of memory currently marked as ``in use'' by the kernel is completely useless as a means of seeing if a Unix system has enough memory; if it doesn't say that nearly all memory is in use pretty much all the time, you've got 'way more memory than you need.
It would also be possible for a process to do all its own paging, particularly if we use either a local or a working set strategy; this could be implemented pretty straightforwardly (I don't see any particular security problems with this...). First, there's no particular reason a processor can't have read access to its page table (write access would be something else again!). Given this, it would be able to look at its referenced and dirty bits to decide when a page can fall out of its working set.
The general approach would be that the process would have a system call, requesting the OS to map ``n'' pages starting at location ``m,'' along with a second system call to unmap the pages. Now all it takes is a ``you just page faulted'' signal and a ``you need to give me a page'' signal (enforcing the requirement that the process give up a page would be a little problematic, but not badly so).