The Linux 2.6 Scheduler

Note: this information has all been distilled from Josh Aas's excellent, in-depth discussion of the scheduler. His paper is available at http://josh.trancesoftware.com/linux/; for the whole details on how things really work, see that paper.

Goals

The Linux 2.6 scheduler is intended for use in an extremely broad range of application domains with fundamentally incompatible goals:

Desktop, Server, and HPC Environments

The first way in which the scheduler is aimed at several very different environments deals in the range of application domains Linux systems are used in.

Linux was first developed as a desktop operating system (for Linux Torvalds's desktop!). However, its first successful market penetrations occurred in the server and high-performance computing markets: in effect, the traditional Unix markets. These three application domains have very different requirements for response time and throughput:

Uniprocessor, SMT, SMP, and NUMA Environments

The second challenge comes from the range of hardware environments in which the scheduler is deployed:

The Scheduler

Having said that, lets move on to the scheduler implementation. We're only discussing SCHED_OTHER processes here, not the "real-time" scheduling.

Static and Dynamic Priority

A Linux process has both a "static" and a "dynamic" priority. The static priority is the nice value as defined by a user; the lower the nice value the higher the priority. A normal user can set a nice value between 0 and 20; the superuser can set nicencess as low as -19. The default niceness is 0.

The dynamic priority is a calculated by modifying the static priority according to the process's behavior; a CPU hog will have a penalty applied, while an IO bound process will have a boost applied.

An IO bound process will spend much of its time sleeping, waiting for resources other than the CPU. When a process is awakened from a sleep, its total sleep time is added to a per-process variabled called sleep_avg. When a process gives up the CPU, either voluntarily because it is waiting for a resource or involuntarily because it got preempted, the time it spent running is subtracted from sleep_avg. The maximum sleep_avg is bounded by a MAX_SLEEP_AVG parameter.

The process's sleep_avg is mapped into a bonus value between -5 and 5 using a simple linear formula.

The Timeslice

The scheduler dynamically adjusts both a process's dynamic priority and the length of its timeslice. An interactive process is (contrary to my intuition!) given a longer timeslice than a compute-bound process; timeslice is calculated by mapping the dynamic priority to a timeslice value.

Another heuristic is that "timeslice" as used by the scheduler doesn't quite mean the same thing as "timeslice" in our textbook. In particular, there is a parameter called the TIMESLICE_GRANULARITY. If a process has been running for that long, and there are other processes with the same dynamic priority, it will be preempted and the other processes with the same priority run in round-robin fashion within the current epoch (we'll get to epochs a little bit later).

Interactivity Credits

Interactivity credits limit the volatility of dynamic priority changes. A process that has been blocked for a long time has a high interactivity credit, and one that has been running for a long time without blocking has a low interactivity credit. A process with a high interactivity credit will have less time subtracted from its sleep_avg if it has a long CPU run; a process with a low interactivity credit will have less time added to its sleep_avg when it wakes up from a long sleep.

Run Queue

The Linux run queue consists of two priority arrays, the "active" array and the "expired" array. An "epoch" is defined by the time it takes to run every process in the active priority array.

During an epoch, the scheduler will examine every priority level in the active runqueue, and run every process in the array until it has exhausted its timeslice or has blocked. If it exhausts its timeslice, a new dynamic priority and timeslice is calculated for it, and it is placed in the expired array. When all the processes in the active array have been run (and the active array has become empty), the pointers to the active and the expired arrays are swapped.

Load Balancing

The CPU load is just the number of runnable tasks on that CPU. This is maintained as a running average, so we don't over-react to sudden swings in the number of runnable processes.

Ordinarily, processes stay on the processor on which they were spawned. However, situations may arise in which one processor has many runnable processes, and another may have few or none. Load balancing is the term referring to trying to fix this situation.

Scheduler Domains

For purposes of load-balancing, a system is divided into scheduler domains. If the system is anything other than a NUMA multiprocessor, it has one domain; if it is a NUMA machine, then each node in the multiprocessor is a scheduler domain.

Each scheduler domain divides its CPUs into groups. On a uniprocessor or SMP system, each processor is a group; in an SMT system, the virtual processors in a single physical processor are in the same group.

In a NUMA system there are two levels of scheduler domains; in addition to what was described above, there is a top-level scheduler domain each of whose groups is a node.

A domain's load is balanced within that domain. Tasks are moved between groups in a domain when groups within the domain become unbalanced. The load of a group is the sum of the loads of its CPUs.

Rebalance Logic

Every so often (determined by the "balance interval", a CPU will attempt to rebalance its load). When that happens, it looks around and decides what CPU within its group is busiest. If that is somebody other than itself, it will move tasks form the busiest group's busiest runqueue to the current CPU's runqueue.

Beyond this, there are a number of heuristics intended to keep the rebalancing logic from constantly moving tasks around.


Last modified: Wed Nov 2 11:25:55 MST 2005