Note: this information has all been distilled from Josh Aas's excellent, in-depth discussion of the scheduler. His paper is available at http://josh.trancesoftware.com/linux/; for the whole details on how things really work, see that paper.
The Linux 2.6 scheduler is intended for use in an extremely broad range of application domains with fundamentally incompatible goals:
The first way in which the scheduler is aimed at several very different environments deals in the range of application domains Linux systems are used in.
Linux was first developed as a desktop operating system (for Linux Torvalds's desktop!). However, its first successful market penetrations occurred in the server and high-performance computing markets: in effect, the traditional Unix markets. These three application domains have very different requirements for response time and throughput:
On a desktop, interactive response is by far the most important consideration. A user at a monitor wants to see immediate response to keystrokes and mouse clicks.
A server must balance requirements for interactivity with throughput. In general, throughput is a much more important consideration than interactivity, since a user will tolerate a delay of several seconds in tasks such as downloading a web page. All the same (as Aas points out), consider the download of two huge files from an FTP server: even if we could complete both downloads more quickly by serving one to completion and then the other, this would be regarded as unacceptable by users.
In a high-performance computing environment, single applications may run for days or weeks before producing outputs. In these environments, interactive response is completely unimportant in comparision to throughput.
The second challenge comes from the range of hardware environments in which the scheduler is deployed:
All we discuss in most of this class is traditional single-processor environments. But that's not all Linux is used in, and it's not even clear that this will remain the dominant environment for much longer....
SMT (symmetric multithreading) refers to "hyperthreading" (as Intel refers to it). The idea is that we only have one set of functional units, but we have two virtual processors sharing them. In an SMT environment processes can migrate between the virtual CPUs freely, as everything up through L1 cache is shared (at the moment, it doesn't look like SMT has worked very well; the two virtual processors sharing cache has hurt performance more than running two instruction streams at once has helped it. But we'll see what the future holds)
SMP (symmetric multiprocessing) refers to using two (or more processors) that share memory. Here, the caches are separate, so we don't want to move a process from one processor to another if we don't have to. Note that the multicore CPU chips coming out share L2 cache but not L1, so in some sense they are intermediate between SMT and SMP as defined here.
NUMA (Non-Uniform Memory Access) refers to multiple processors sharing memory, but with the access time to various parts of memory being different for the various processors. Here you really don't want to move processes around if you can help it, since you then have the problem of getting the memory to follow it.
Having said that, lets move on to the scheduler implementation. We're
only discussing SCHED_OTHER processes here, not the
"real-time" scheduling.
A Linux process has both a "static" and a "dynamic" priority. The static priority is the nice value as defined by a user; the lower the nice value the higher the priority. A normal user can set a nice value between 0 and 20; the superuser can set nicencess as low as -19. The default niceness is 0.
The dynamic priority is a calculated by modifying the static priority according to the process's behavior; a CPU hog will have a penalty applied, while an IO bound process will have a boost applied.
An IO bound process will spend much of its time sleeping, waiting for
resources other than the CPU. When a process is awakened from a
sleep, its total sleep time is added to a per-process variabled called
sleep_avg. When a process gives up the CPU, either
voluntarily because it is waiting for a resource or involuntarily
because it got preempted, the time it spent running is subtracted from
sleep_avg. The maximum sleep_avg is bounded
by a MAX_SLEEP_AVG parameter.
The process's sleep_avg is mapped into a bonus value
between -5 and 5 using a simple linear formula.
The scheduler dynamically adjusts both a process's dynamic priority and the length of its timeslice. An interactive process is (contrary to my intuition!) given a longer timeslice than a compute-bound process; timeslice is calculated by mapping the dynamic priority to a timeslice value.
Another heuristic is that "timeslice" as used by the scheduler doesn't
quite mean the same thing as "timeslice" in our textbook. In
particular, there is a parameter called the
TIMESLICE_GRANULARITY. If a process has been running for
that long, and there are other processes with the same dynamic
priority, it will be preempted and the other processes with the same
priority run in round-robin fashion within the current epoch (we'll
get to epochs a little bit later).
Interactivity credits limit the volatility of dynamic priority
changes. A process that has been blocked for a long time has a high
interactivity credit, and one that has been running for a long time
without blocking has a low interactivity credit. A process with a
high interactivity credit will have less time subtracted from its
sleep_avg if it has a long CPU run; a process with a low
interactivity credit will have less time added to its
sleep_avg when it wakes up from a long sleep.
The Linux run queue consists of two priority arrays, the "active" array and the "expired" array. An "epoch" is defined by the time it takes to run every process in the active priority array.
During an epoch, the scheduler will examine every priority level in the active runqueue, and run every process in the array until it has exhausted its timeslice or has blocked. If it exhausts its timeslice, a new dynamic priority and timeslice is calculated for it, and it is placed in the expired array. When all the processes in the active array have been run (and the active array has become empty), the pointers to the active and the expired arrays are swapped.
The CPU load is just the number of runnable tasks on that CPU. This is maintained as a running average, so we don't over-react to sudden swings in the number of runnable processes.
Ordinarily, processes stay on the processor on which they were spawned. However, situations may arise in which one processor has many runnable processes, and another may have few or none. Load balancing is the term referring to trying to fix this situation.
For purposes of load-balancing, a system is divided into scheduler domains. If the system is anything other than a NUMA multiprocessor, it has one domain; if it is a NUMA machine, then each node in the multiprocessor is a scheduler domain.
Each scheduler domain divides its CPUs into groups. On a uniprocessor or SMP system, each processor is a group; in an SMT system, the virtual processors in a single physical processor are in the same group.
In a NUMA system there are two levels of scheduler domains; in addition to what was described above, there is a top-level scheduler domain each of whose groups is a node.
A domain's load is balanced within that domain. Tasks are moved between groups in a domain when groups within the domain become unbalanced. The load of a group is the sum of the loads of its CPUs.
Every so often (determined by the "balance interval", a CPU will attempt to rebalance its load). When that happens, it looks around and decides what CPU within its group is busiest. If that is somebody other than itself, it will move tasks form the busiest group's busiest runqueue to the current CPU's runqueue.
Beyond this, there are a number of heuristics intended to keep the rebalancing logic from constantly moving tasks around.