Here comes a brief tour of where the code is, and what it does....
These notes are based on an absolutely antique version of the Linux kernel -- version 1.0. They aren't intended to give the reader any insights into modern Linux page-handling; I don't know how much of the information here is even still correct.
Speaking for myself, I find that early versions of code are normally simpler and easier to understand than later versions. This is here to give a concrete example of how page faults can be handled in a real OS.
There are several places where apparently-redundant checks are made. There is a comment in the code to the effect that this is necessary due to race conditions; I'm afraid I don't know the code well enough to be able to say when there's a real race condition being avoided vs. just some code that really is redundant.
A page fault in the Intel architecture causes Interrupt 14 to happen.
It also puts the virtual address of the address that faulted into a
system control register named
cr2, and pushes an "error
code" on the stack.
There's a lot in the error code we don't care about (in particular, it relates to the IA32 segmented memory model); the two things that do matter to us are that bit 0 amounts to detecting whether we faulted on a page that's present, and bit 2 amounts to being an indicator as to whether we are running in user or kernel mode when the interrupt happens.
A page fault can be either major or
are distinguished, not by the type of fault, but by whether or not the
process that caused the fault is suspended due the time required to
handle the fault; this in turn is determined by whether IO is required
to bring the data in from the disk.
I'm still looking for a definitive statement regarding the conditions under which a fault might be minor. The only example of a situation causing a minor fault that I've come across so far is when a COW page is accessed; making a copy of the page doesn't require any disk activity.
Interrupt vectors in the IA32 architecture are referred to as "trap gates". There's actually a lot of information in a trap gate beyond just where to branch to in the event of an interrupt (more segmented memory model stuff), but that's the part we're interested in here.
All the trap gates (including the page fault trap gate) are set in
the trap gates are set in a procedure called
the page fault gate is set on line 215. This can be done in C code,
since all we're doing is setting up data structures.
The actual code for
_page_fault is located in .../linux/kernel/sys_call.S.
This has to be done in assembly code, since it needs lots of direct
access to things like registers and direct access to the stack.
_page_fault is the last function in the file; it pushes
the address of a C routine called
do_page_fault() on the
stack, and jumps to code that will extract the error code from the stack.
When we jump to
error_code, we start by pushing all the
registers on the stack (this will preserve them for the process when
we come back; our C code is going to corrupt them all). Bunch of
pushes now, that set up the procedure call linkage for
our C interrupt handler. The address of the handler ends up in the
ebx register; we make an indirect procedure call through
that register to land in
The C code is found in .../linux/mm/memory.c at line 886. Now's where we want to slow down a bit and figure out what's happening.
First, a little bit of inline assembly code. gcc has both the most
powerful and most obscure inline assembly code facilities I've ever
seen; essentially, rather than just specifying the actual assembly
code, it defines an instruction template describing how the
instruction is used, and lets the compiler figure out how to get the
data into the form needed for the instruction. What this instruction
does is to take the contents of the
cr2 register and put
them in a variable called
address which the compiler is
able to use in C code.
Now we ask whether we are in an address that is in the user space or
the kernel space (
TASK_SIZE is defined in
0xc0000000 -- the location where the kernel
starts). If we are within the task space and we are in user mode, we
check to see if we're in virtual-8086 mode (that's the check involving
regs->eflags) and if not we check to see if we had a
write-protection error vs. a missing page.
This is the simpler case, so I want to talk about it first. We go to
do_wp_page() to figure out what just went wrong. First
thing we do is to get the directory entry for the faulting address; if
there is then we go on and get the page table entry.
In the page table entry, we check to see if the page either isn't present or shouldn't have had a write-protect error; in either of these cases we just return.
Now we know we either have a COW page or we have a real protection failure. If it's a protection failure we modify the process control block with information about the failure, and deliver a SIGSEGV to the process.
If it's a COW page, we check to see
In this case, we go to
do_no_page(). This, in turn,
get_empty_pgtable() to either get the page table
for the faulting address if one already exists, or to allocate a new,
empty page table for it if there isn't one. We'll go through how this
Now we look up the page table entry again; if the page turns out to be present we don't have anything else to do. If it's