...making Linux just a little more fun!
By Vinayak Hegde
Two of the most critical parts of a kernel are the memory subsystem and the scheduler. This is because they influence the design and affect the performance of almost every other part of the kernel and the OS. That is also why you would want to get them absolutely right and optimize their performance. The Linux kernel is used right from small embedded devices, scaling up to large mainframes. Designing is scheduler is at best a black art. No matter how good the design is, some people will always feel that some categories of processes have got a raw deal.
In the article that follows, I have purposely tried to skip quoting any reference code because one can easily get it from the net (see the references). The article also looks the challenges developers found when they redesigned the scheduler, how those challenges were met, and what could be the the future direction the scheduler is likely to follow. Having said that, there's nothing like reading the source code to get an understanding of what's going on under the hood. You will find the implementation of the scheduler in kernel/sched.c if you have the kernel source installed.
The Linux scheduler strives to meet several objectives :
The scheduler should give a fair amount of CPU share to every process. Quite a fair amount of work has been done in the new kernel to ensure fair sharing of CPU time among processes.
The scheduler should try to maximize both throughput and CPU utilization. The usual method of increasing CPU utilization is by increasing the amount of multi-programming. But this is only beneficial up to a point after which it becomes counterproductive and thrashing starts.
A scheduler itself should run for as small time as possible. The scheduler latency should be minimal. But this is the tricky part. It is generally seen that scheduling itself is not useful work (??) . But if the scheduling is done right even after expending some more time, it may be worth the effort. But how do we decide which is the optimal point? Most scheduler solve this problem by using some heuristic policies.
Priority scheduling means that some processes get more preference over others. At the very least the scheduler must differentiate between I/O-bound processes and CPU-bound processes. Moreover, some kind of aging must be be implemented so that starvation of processes does not take place. Linux does enforce priorities as well as differentiates between different categories of processes. The Linux kernel differentiates between batch scheduled jobs and interactive. They get a share of the CPU according to their priorities. Probably some people have used the nice command to change the priority of a process.
Turnaround time is the sum of the service time and amount of time wasted waiting in the ready queue. The scheduler tries to reduce both.
The response from a program should be as fast as possible. Also another important factor which is often ignored is the variance between the response times. It is not acceptable if the average response time is low but some user occasionally experiences say, a 10-second delay from an interactive program.
The scheduler also tries to meet some other goals such as predictability. The behavior of the scheduler should be predictable for a given set of process with assigned priorities. The scheduler performance must degrade gracefully under heavy loads. This is particularly important because the Linux is very popular in the server market, and servers tend to get overloaded during peak traffic hours.
What's O(1) you may ask? Well, I am going to skim the surface on what the O (known as the Big-Oh) notation means. You will find references to the Big-Oh notation in any good algorithms book. What the Big Oh notation essentially does is that it tries to estimate the running time of any algorithm independent of the machine-related implementation issues. It places a upper bound on the running time of the algorithm - that is an upper bound on the algorithm's worst case. It is an exceptional technique to compare the efficiency of an algorithm with respect to running time.
Take for example an algorithm which has two nested loops and both of whose limits range from 1 to n-1 where (n is number of inputs for the algorithm) then the upper bound on it's running time is denoted by O(N2) . Similarly, consider an algorithm to search for an element in an unordered linked list. The worst case is that we will have to traverse the list till the last element to get a match or worse still - to find out that the element is not in the list. Such an algorithm is said to have O(N) complexity as it's running time is directly proportional to the number of elements in the list - N.
The Linux scheduler takes constant time to schedule the processes in the ready queue. Hence, it is said to have O(1) complexity. No matter what is the number of processes active on the Linux system the scheduler will always take constant time to schedule them. All the "parts" - wakeup , selection of the process to be run next, context switching and timer interrupt overhead - of the current Linux kernel (2.5.49 is the reference here - See resources for details) have O(1) complexity. Hence, the new scheduler has O(1) complexity in it's entirety.
As mentioned in the introduction, the Linux kernel runs on almost anything from wristwatches to supercomputers. With the earlier schedulers there were some problems with the scalability. Some of these problems were solved by modifying the kernel for the application and the target architecture. However, the core design of the scheduler was not very scalable. The new scheduler is much more scalable and SMP (Symmetric Multi-Processing) aware. The performance of the scheduler is much better for SMP systems now. One on the goals stated by Ingo Molnar who has written the O(1) scheduler is that in SMP, processors should not be idle when there is work to do. Also, care should be taken that processes do not get scheduled on different processors form time to time. This is to avoid the overhead of filling in the cache with the required data every time.
This is not exactly a new feature, but there are some patches that can be applied to support batch scheduling. Earlier kernels also had some support for batch scheduling. As of now, the batch scheduling of task is done using priority levels. The Linux kernel uses about 40 nice levels (though they are all mapped to about 5 levels of priority). Batch scheduled processes generally get the CPU when there are not many interactive and CPU-bound processes around, which have more priority. Batch scheduled processes get bigger time-slices to run than normal processes. This also minimizes the effect of swapping data in and out of the cache frequently, thus improving the performance of batch jobs.
One of the major improvements in the new scheduler is detection and boosting the performance of interactive tasks. Tweaking the old scheduler code was a bit cumbersome. In the new scheduler, detection of interactive jobs is decoupled from other scheduler tasks such as time-slice management. Interactive jobs in the system are detected in the system with the help of usage-related statistics. That means interactive tasks have good response times under heavy loads, and CPU-hogging processes cannot monopolize CPU time. The new scheduler actively detects interactive tasks and give them more precedence over other tasks. Even then, a interactive task is scheduled with other interactive tasks by using Round-Robin scheduling. This makes a lot of sense for desktop users as they will not see response time increase when they start a CPU intensive job such as encoding songs into the ogg format. Plans are on to merge O(1) and pre-emption code to give better response times for interactive tasks.
Because of the redesign of the new scheduler, it scales more easily to other architectures such as NUMA (Non-Uniform Memory Access) and SMT. NUMA architecture is used on some high-end servers and supercomputers. Also there is some ongoing work on the SMT (Symmetric Multi-Threading). SMT is also known by a more popular term: HyperThreading. One of the reasons is that now every CPU has it's own run queue. Only the load balancing sub-part has a "global" view of the system. So any changes for some architecture are to be made mainly in the load balancing sub-part. Also recently a patch for NUMA was released. In fact it has been incorporated in Linux 2.5.59. SMT processors implement two (or more) virtual CPUs on the same physical processor - one "logical" processor can be running while the other waits for memory access. SMT can certainly be seen as a sort of NUMA system, since the sibling processors share a cache and thus have faster access to memory that either one has accessed recently. Some work is going also going on SMT, but the new O(1) scheduler handles SMT pretty well even without any changes. Recently a patch was released for SMT. Though the NUMA architecture bears some resemblance to the SMT architecture, the Linux scheduler handles them differently.
The scheduler gives more priority to fork()ed children than to the parent itself. This could potentially be useful for servers which use the fork call to service requests. It could also be useful for GUI applications. There are also some real-time scheduling (based on priority) available in the kernel.
Vinayak is currently pursuing the APGDST course
at NCST. His areas of interest are networking, parallel
computing systems and programming languages. He
believes that Linux will do to the software industry
what the invention of printing press did to the world
of science and literature. In his non-existent free
time he likes listening to music and reading books. He
is currently working on Project LIberatioN-UX where he
makes affordable computing on Linux accessible for
academia/corporates by configuring remote boot stations