Page Fault Handling in the Linux 1.0 Kernel

Here comes a brief tour of where the code is, and what it does....

Some Background

These notes are based on an absolutely antique version of the Linux kernel -- version 1.0. They aren't intended to give the reader any insights into modern Linux page-handling; I don't know how much of the information here is even still correct.

Speaking for myself, I find that early versions of code are normally simpler and easier to understand than later versions. This is here to give a concrete example of how page faults can be handled in a real OS.

About the Code

There are several places where apparently-redundant checks are made. There is a comment in the code to the effect that this is necessary due to race conditions; I'm afraid I don't know the code well enough to be able to say when there's a real race condition being avoided vs. just some code that really is redundant.

Page Faults in IA32

A page fault in the Intel architecture causes Interrupt 14 to happen. It also puts the virtual address of the address that faulted into a system control register named cr2, and pushes an "error code" on the stack.

There's a lot in the error code we don't care about (in particular, it relates to the IA32 segmented memory model); the two things that do matter to us are that bit 0 amounts to detecting whether we faulted on a page that's present, and bit 2 amounts to being an indicator as to whether we are running in user or kernel mode when the interrupt happens.

Major and Minor Faults

A page fault can be either major or minor. They are distinguished, not by the type of fault, but by whether or not the process that caused the fault is suspended due the time required to handle the fault; this in turn is determined by whether IO is required to bring the data in from the disk.

I'm still looking for a definitive statement regarding the conditions under which a fault might be minor. The only example of a situation causing a minor fault that I've come across so far is when a COW page is accessed; making a copy of the page doesn't require any disk activity.

The Bottom-Level Interrupt Handler

Interrupt vectors in the IA32 architecture are referred to as "trap gates". There's actually a lot of information in a trap gate beyond just where to branch to in the event of an interrupt (more segmented memory model stuff), but that's the part we're interested in here.

All the trap gates (including the page fault trap gate) are set in .../linux/kernel/traps.c. All the trap gates are set in a procedure called trap_init(); the page fault gate is set on line 215. This can be done in C code, since all we're doing is setting up data structures.

The actual code for _page_fault is located in .../linux/kernel/sys_call.S. This has to be done in assembly code, since it needs lots of direct access to things like registers and direct access to the stack. _page_fault is the last function in the file; it pushes the address of a C routine called do_page_fault() on the stack, and jumps to code that will extract the error code from the stack.

When we jump to error_code, we start by pushing all the registers on the stack (this will preserve them for the process when we come back; our C code is going to corrupt them all). Bunch of pushes now, that set up the procedure call linkage for our C interrupt handler. The address of the handler ends up in the ebx register; we make an indirect procedure call through that register to land in do_page_fault().

The C page fault handler

The C code is found in .../linux/mm/memory.c at line 886. Now's where we want to slow down a bit and figure out what's happening.

First, a little bit of inline assembly code. gcc has both the most powerful and most obscure inline assembly code facilities I've ever seen; essentially, rather than just specifying the actual assembly code, it defines an instruction template describing how the instruction is used, and lets the compiler figure out how to get the data into the form needed for the instruction. What this instruction does is to take the contents of the cr2 register and put them in a variable called address which the compiler is able to use in C code.

Now we ask whether we are in an address that is in the user space or the kernel space (TASK_SIZE is defined in .../linux/include/linux/sched.h to be 0xc0000000 -- the location where the kernel starts). If we are within the task space and we are in user mode, we check to see if we're in virtual-8086 mode (that's the check involving regs->eflags) and if not we check to see if we had a write-protection error vs. a missing page.

Fault on Write-Protect

This is the simpler case, so I want to talk about it first. We go to do_wp_page() to figure out what just went wrong. First thing we do is to get the directory entry for the faulting address; if there is then we go on and get the page table entry.

In the page table entry, we check to see if the page either isn't present or shouldn't have had a write-protect error; in either of these cases we just return.

Now we know we either have a COW page or we have a real protection failure. If it's a protection failure we modify the process control block with information about the failure, and deliver a SIGSEGV to the process.

If it's a COW page, we check to see

Fault on a Missing Page

In this case, we go to do_no_page(). This, in turn, calls get_empty_pgtable() to either get the page table for the faulting address if one already exists, or to allocate a new, empty page table for it if there isn't one. We'll go through how this works later.

Now we look up the page table entry again; if the page turns out to be present we don't have anything else to do. If it's

Last modified: Thu Dec 23 10:47:15 MST 2010