Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4

Having a dual CPU Linux system is cool, if you have deep enough pockets to buy one. You need costlier motherboards, more expensive power supply, and two CPUs which obviously cost twice as much as one CPU. That was, until Intel brought out the Hyperthreaded (HT) Pentium 4. You get two `logical' CPUs within one package, and you don't pay the price for two (of course, you don't get the performance of a real dual CPU system - if somebody said so, that should be marketing talk)! I recently purchased one such system and am still under the charm of `cat /proc/cpuinfo' telling me that I have two CPUs in my box. This article explains various experiments which I conducted on my HT system running the shining new Linux kernel 2.6.5 mostly with a view towards understanding concurrency control in Symmetric Multi Processing systems. Readers are expected to have an elementary understanding of Operating System concepts and Linux kernel programming to understand the material presented here.

What is Hyperthreading?

We visualize the microprocessor processing an instruction stream by fetching, decoding and executing instructions at a breakneck speed. There are many speed barriers on the way. Suppose the microprocessor encounters an instruction to do I/O - it will have to wait for data to arrive from a peripheral unit over a comparatively slow `bus'. Suppose during the course of executing an instruction, some data has to be obtained from main memory. If the data is not available in the cache closest to the execution engine, it will have to be obtained from a slower cache down the hierarchy, or worse, from main memory (main memory access times are very long compared to the CPU clock cycle, with the clock reaching about 3GHz in modern processors). A microprocessor with an `Out of Order' execution engine tries to identify instructions ahead in the stream which can be executed in parallel. But there are limits to this process - limits imposed by the inherent serial nature of the instruction stream. Many a time, the next instruction will depend on the results generated by the current instruction and so on.

What if the microprocessor can process two instruction streams in parallel? Doing this will require that it maintains two sets of `architectural state' (general purpose registers, instruction pointer etc), but the execution units (the ALU, FPU) need not be duplicated. It is said that this can be done with very little hardware (and thus, cost) overhead. Now we can visualize two instruction streams in memory - the microprocessor maintaining two instruction pointers and fetching instructions from both the streams. Because the execution units are not duplicated, it would be impossible for the microprocessor to really execute two instruction simultaneously - but what if an instruction from one stream takes a long time to complete (maybe, it is doing I/O - or there was a cache miss and its waiting for data to arrive from main memory). During this period, the microprocessor can execute instructions from the other stream using the execution units which are sitting idle. This is the basis of Hyperthreading (or, what I understand to be the basis of Hyperthreading - the more inquisitive readers can form their own conclusions by reading some excellent documents, links to which I have put up in the concluding part of this article). The key to performance improvement here is that the two threads provide a mix of instructions which employ different execution units within the microprocessor. How much performance improvement will be seen in real application scenarios is still a topic of debate. Bottom line is - if you are purchasing an HT system for performance, look before you leap!

A multiprocessor-aware operating system like Linux sees the HT Pentium as it sees a `normal' twin CPU system, with logical processors 0 and 1. The scheduler can run tasks on any one of the CPUs.

My Setup

I am using an HT enabled 2.8GHZ Pentium 4. The Linux distribution on this box is PCQLinux 2004, which is based on the latest offering from Fedora.

I boot the system with Linux kernel 2.6.5 anxiously looking for signs which proclaim that I am the owner of a twin CPU system. Here is part of the output from `dmesg':


Initializing CPU#0
Intel machine check reporting enabled on CPU#0.
CPU#0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#0: Thermal monitoring enabled
enabled ExtINT on CPU#0

Initializing CPU#1
masked ExtINT on CPU#1
Intel machine check reporting enabled on CPU#1.
CPU#1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#1: Thermal monitoring enabled

The next place to look is /proc/cpuinfo, and sure enough, it says that there are two CPUs. Interrupts are shared by both the processors - so /proc/interrupts is also interesting. It shows the number of interrupts of each category processed by each logical processor. Running the `top' command displays the CPU on which each process is running.

Playing with the scheduler

The 2.6 scheduler provides a system call to set task affinity. Normally, the scheduler is free to reschedule a process on any CPU. Using the sched_setaffinity system call, we can instruct it to restrict a process to one set of CPUs. The manual page of this system call provides the prototype as:


int sched_setaffinity(pid_t pid,unsigned int len, 
                      unsigned long *mask);

which is wrong! The prototype in the header file /usr/include/sched.h is:


int sched_setaffinity(__pid_t __pid, 
                      __const cpu_set_t *__mask);

This is what we should use. The first argument is the pid of the process whose scheduler affinity mask is to be altered (pid 0 means current process). The second argument is the mask. The mask should be manipulated only using the macros provided to do the same. Here is a function which when called with a logical CPU number as argument binds the calling process to that particular CPU (i.e., it will not be scheduled on any other CPU).

[Listing 1, using the sched_setaffinity call]


#define _GNU_SOURCE
#include <sched.h>

void run_on_cpu(int cpu)
{
        cpu_set_t mask;
        CPU_ZERO(&mask);
        CPU_SET(cpu,&mask);
        sched_setaffinity(0,&mask);
}

CPU_ZERO initializes all the bits in the mask to zero. CPU_SET sets only the bit corresponding to cpu. It would be interesting to run lots of copies of this program, keeping top running on another console all the time. We will see all the processes getting scheduled on the same CPU. Another interesting experiment might be to run a command like yes


yes > /dev/null &

Note down the CPU on which it is being scheduled and then run lots of copies of the program in [Listing 1] on the same CPU. We will see the scheduler migrating yes to the other CPU because it is relatively idle. If the affinity mask is not explicitly altered, a process will be rescheduled on any processor.

Measuring performance

Using toy programs for benchmarking is a stupid idea. Still, we wish to see whether there is some benefit to HT, at least in tailor-made situations. I wrote a program which had two functions - one reading from an I/O port in a loop and the other one incrementing a floating point number in a loop. Executed serially on one CPU, the program takes about 21 seconds (10 seconds each for the numeric and I/O functions). When the two tasks are executed on two different CPUs, the run time becomes about 12.4 seconds - a really big speedup. Is this behaviour expected or have I got something wrong? I am not really sure at this point. Readers with access to HT systems can surely help by commenting on the code and also trying to measure things on their systems.

[Listing 2, trying to measure HT performance]

Both the tasks running only do_io or do_float resulted in a decrease in measured speedup. When do_float was changed to do only integer arithmetic and both tasks (running on separate CPUs) were made to do do_float, there was a severe degradation in performance (when compared to both tasks running only on one CPU). I believe one thing of which we can be sure here is that the instruction mix is really significant. As any useful program would be doing lots of arithmetic, lots of I/O as well as plenty of memory accesses, a combination of such programs will give the hyperthreaded CPU an opportunity to execute instructions from both the threads in parallel using unused execution units.

Another experiment which I did was to run two jobs - one accessing memory in such a way as to generate lots of cache misses and the other one doing some floating point computations. About 25% speedup was observed when the tasks were made to run on two CPUs.

[Listing 3]

Playing with the kernel

The 2.6 kernel has brought in lots of scalability enhancements which makes Linux a platform of choice for heavy duty server applications. There have been radical changes in many kernel subsystems and the programming interface which the kernel presents to driver writers has changed at many places. Let's first try to build a kernel module using the new `kbuild' mechanism.

Compiling a simple module

The module does nothing - it can be loaded and unloaded, thats all.

[Listing 4, a simple kernel module]

We need a Makefile containing the following lines: (4.c is the name of our module)


obj-m := 4.o

default:

    make -C /usr/src/linux-2.6.5 SUBDIRS=`pwd`

Now, just invoke make and the kernel build system will automagically build the module. More details can be found in documents under Documentation/kbuild of the kernel source tree. Note that the object file to be loaded is named 4.ko.

Understanding race conditions

When two CPUs try to execute code which modifies shared objects in parallel, the outcome will depend on the order in which the accesses/modifications took place. For the remaining part of this discussion, we shall assume that if two CPUs try to access memory in parallel, some kind of hardware arbitration circuit will serialize the access - but which CPU gets access first is essentially unpredictable.

The SMP-aware Linux kernel allows multiple CPUs to execute kernel as well as user level code in parallel. Let's think of what would happen if code running on two CPUs tries to execute the following sequence simultaneously.


temp = count;
temp = temp + 1;
count = temp;

The variable count is a global object shared by both the executing tasks whereas temp is local to each task. One execution timeline can be:

Both CPUs try to do temp=count in parallel. Say CPU0 gets access to memory first. Code running on it will store the current value of count (say 0) in temp. Next, CPU1 gets access to memory. The task running on CPU1 also stores the current value of count (which is 0) in a local variable. Now, both CPUs increment their local copies. Both CPUs write back the value of temp (which is 1) to count. Note that even though both CPUs have executed code to increment count, because of execution overlaps, the count has become 1 and not 2. This is a really big problem. The problem would not have occurred had the code running on one CPU, say CPU1, read the value of count after the code running on the other CPU had finished incrementing it. But there is no such guarantee on a complex multiprocessor system.

Let's try to demonstrate this race condition experimentally. Even though we are not using a real dual CPU system here, the behaviour will be identical. We will write a simple character driver with a `device_write' method which keeps updating a global variable `count' in a loop. We will run two tasks, initially, both on the same CPU. The tasks will open a device file and invoke the `write' system call many times in a loop, with the effect that the code in the kernel gets activated. If each task invokes `write' 100 times, and each device_write updates the global `count' variable 100 times, then the total count should be, after both the tasks are over, 20000. Any other value is an indication of a race condition. We will experiment with both tasks running on the same CPU and on different CPUs.

[Listing 5, kernel module containing a race condition]

The test program forks a child process. Both the parent and child execute `write' on the same file descriptor, with the effect that `foo_write' gets invoked a total of 200000000 times. Each `foo_write' increments the count by 1 - if there is no race condition, the count should be 200000000. We do have a race here, so the count should be less than 200000000, which is what we observe when the parent and the child run on different CPUs. But we have a surprising result - even when the parent and child run on the same CPU, we get values less than 200000000. Why?

[Listing 6, testing the kernel module]

The answer is that I have compiled the 2.6 kernel with kernel preemption enabled. Classical uniprocessor Unix kernels are non-preemptive. That is, if a task enters the kernel, until it comes out, no other task would be scheduled. This design reduces problems associated with concurrency a lot. Kernel code which modifies a global data structure (say a linked list) need not be worried that it will be preempted during the execution of a critical section and some other task would enter kernel mode and access/modify the same data structure resulting in race conditions. The trouble with this design is that tasks which execute for a long time in the kernel would reduce the responsiveness of the system by not letting interactive tasks preempt them. The 2.6 kernel can be compiled with preemption enabled. The result is that even if both tasks are running on the same CPU, one task may be preempted after it has read and stored the value of count in the temporary variable and before it writes back the incremented value. This will result in race conditions cropping up.

The solution is to disable preemption by calling preempt_disable and enable it once again after the critical section is over by calling preempt_enable. These functions basically manipulate a preemption count, which can be obtained by calling preempt_count

Doing an `atomic' increment

What if we change the three line sequence which increments count to just one line?


count++;

We don't have any guarantees that race conditions will not occur in this case also. The x86 compiler may translate count++ to three separate instructions which first fetches a value from memory and stores it in a register, increments the register and stores it back to the memory location. The GNU C Compiler version 3.3.2 on my system in fact translates count++ (count is declared as volatile) to three statements. If we wish to make sure that count is getting incremented using just one assembly instruction, we will have to make use of inline assembly code like this:


__asm__ __volatile__ (
   "incl %0\n\t"
   :"=m"(count)
   :"m"(count)
);

Modifying the program this way ensures that if both tasks are running on one CPU, and even if preemption is not disabled, we get the correct results. This is because the current task won't be preempted until the assembly instruction it is executing is over. What if we make the tasks run on different CPUs? We note that again we are getting random values for count. This is because the `incl' statement operates internally by bringing the value in memory to some place within the CPU, incrementing it and then transferring it back to the same location. It is possible that if two CPUs try to execute this sequence in parallel, both the CPUs might load the value in memory to some location internal to them (the memory system hardware ensures that only one CPU is able to read from a location at one instant of time. But once that read is over, there is no guarantee that the other CPU will not be able to read from memory till the whole read-modify-write sequence running on the first CPU is completely over), increments the value and writes back to memory with the result that the count gets incremented by one and not two. This seems to be the case even with Hyperthreading systems - the hardware only guarantees that the load-store operations of one thread happen in sequence. This may get intermixed with the load-store sequences generated by the other thread.

What is the solution to this problem? On x86 systems, the instruction prefix `lock' can be used to make certain assembly instructions atomic. The Intel manuals say that `lock' can be used with only a few instruction (inc is one of them). The hardware somehow guarantees that two instructions prefixed by a `lock' running on two different CPUs and trying to access the same memory location will execute mutually exclusive to each other - that is, only after one instruction runs to completion (the full read-modify-write sequence) will the other instruction start executing.


__asm__ __volatile__ (
   "lock; incl %0\n\t"
   :"=m"(count)
   :"m"(count)
);

The modified program runs slower, but it gives us the correct count of 200000000.

The atomic swap operation

The x86 XCHG instruction swaps a 32 bit register with a memory location. Even if it is not prefixed with a lock, it is guaranteed that XCHG will be atomic. It can be used as a building block for many synchronization primitives like the semaphore. Let's try to implement an atomic increment operation using the atomic swap.

[Listing 7, implementing atomic increment]

Using the builtin atomic functions

The kernel header include/asm/atomic.h defines a few inline functions and macros which can be used for performing atomic operations.


atomic_t count = ATOMIC_INIT(0);
atomic_read(&count); 
atomic_set(&count, 2); 
atomic_inc(&count);

The spin lock

Let's now try to understand how the spin lock, a very fundamental multiprocessor synchronization mechanism, works. The problem is to prevent kernel code running on two CPUs from accessing a shared data structure concurrently. Had the data structure been something as simple as an integer (or some other basic type) and the operation been as simple as performing an addition or a subtraction or an increment, the atomic operations would have been sufficient. But what we wish to do is provide mutual exclusivity to two threads running on two different CPUs accessing an arbitrarily complex data structure using an arbitrary sequence of instructions. We need something more general than the atomic add, sub, or inc operations.

A flawed solution

Let the tasks running on both the CPUs look at a shared variable (let's call it `lock') before executing the section of code which has got to be mutually exclusive. If the value of lock is zero (lock is in the unlocked state), make it one and enter into the critical section. If not, keep on spinning until the lock becomes zero again.

[Listing 8]


void acquire_lock(volatile int *lockptr)
{
        while(*lockptr);
        *lockptr = 1;
}
void release_lock(volatile int *lockptr)
{
        *lockptr = 0;
}

Let's say code running on two CPUs is trying to increment a variable `count' by first storing it in `tmp'. We try to prevent race conditions from occurring by calling acquire_lock before the code sequence begins and release_lock after the code sequence ends.


acquire_lock();

tmp = count;
tmp++;
count = tmp;

release_lock();

Testing the code shows that it is not working as expected. The reason is simple. Suppose code executing on both the CPUs call acquire_lock simultaneously. A possible sequence of events would be:

Code running on CPU0 reads *lockptr and sees that it is zero
Before code running on CPU0 makes *lockptr equal to one, code running on CPU1 sees that *lockptr is zero.
Both code segments DO NOT enter into the while loop
Code on CPU0 makes *lockptr equal to 1
Code on CPU0 happily enters the `critical section'
Code on CPU1 also makes *lockptr equal to 1 and enters the critical section

The problem arose because testing whether *lockptr is zero and then making it equal to one executed non-atomically.

Another Attempt

The kernel header file include/asm/bitops.h presents an atomic test_and_set_bit function which takes an address and a bit number and sets that bit of the location pointed to by address, also returning true if the bit was initially set and false if it was clear. Let's use this function to implement another version of the spin lock.

[Listing 9]


/* Spin locks, another  attempt*/

void acquire_lock(volatile unsigned long *lockptr)
{

        while(test_and_set_bit(0, lockptr));
                ;
}

void release_lock(volatile unsigned long *lockptr)
{
        clear_bit(0, lockptr);
}

Let's try to understand the logic. Assume that all the bits of the lock are initially clear (we use only one bit in this implementation). What if two CPUs try to execute test_and_set_bit concurrently? The implementation guarantees that one CPU will complete the operation before the other one begins it. That is, the value returned on one CPU is going to be definitely false and on the other one, definitely true. The CPU which gets the value 0 has succeeded in acquiring the lock while the other CPU keeps spinning. The spin loop exits only when the CPU which acquired the lock releases it by clearing the lock bit. This code should work properly. It does, on my system (the fact that you keep on getting the correct answer in tens of thousands of test runs is no guarantee that your implementation is correct - concurrent programming is too tough for that kind of simplistic assumptions). The only trouble is that it takes a very long time to get the output when it is used with our kernel module which keeps incrementing count. But that is to be expected. Locking does degrade performance.

The Linux Kernel Spinlock

It's easy to use the spin locks provided by the Linux kernel. The kernel defines a new type spinlock_t. A spin lock is a variable of this type. The lock is initialized by a macro SPIN_LOCK_UNLOCKED.


spinlock_t lock = SPIN_LOCK_UNLOCKED;
spin_lock(&lock); /* Tries to acquire the lock */
spin_unlock(&lock); /* Release the lock */

The x86 kernel implementation contains some hacks to improve locking performance - readers can browse through include/linux/spinlock.h and include/asm/spinlock.h.

If a thread executing in kernel space acquires a spin lock and if it goes on to execute an interrupt service routine (on the same CPU), the ISR itself trying to acquire the lock would create problems. The ISR would keep on spinning, which will result in the lock holder not being able to release the lock. If code running on some other CPU is also spinning for the lock to be released, nobody would be able to make progress. So there are IRQ-safe versions of the lock and unlock function - they are called spin_lock_irqsave and spin_unlock_irqrestore. The first one saves the current state of local interrupts, disables local interrupts and tries to acquire the lock. The second one releases the lock and restores state of local interrupts.

On uniprocessor kernels, the spin_lock and spin_unlock functions are compiled away - they expand to empty statements. But on kernels with preemption, they disable and enable preemption.

Deadlocks

If not done properly, locking can create subtle problems which are difficult to debug. If a task running in kernel mode wants to access/modify two data structures, both protected by a lock (say to copy from one to the other), then the order in which the locks are acquired becomes significant. Say one task does:


lock A
lock B

---critical section---

unlock B
unlock A

and another task does:


lock B
lock A

---critical section---

unlock A
unlock B

There is a danger of deadlock if both CPUs complete the first set of locking operations in parallel (say task running on CPU0 acquires lock A and task running on CPU1 acquires lock B). Now, task on CPU0 will try to acquire lock B and will go in a loop and task on CPU1 will try to acquire lock A and will get into a spin. Both loops will never terminate and the system will freeze. Be careful when you try out the code I present here - you might have to press the reset switch! The moral is that all tasks should acquire locks only in one particular order.

[Listing 10, kernel module demonstrating deadlock]

Here is a user space program which triggers the deadlock:

[Listing 11]

Simply running the user space program a few times from the command prompt did not result in any problem on my machine. The reason might be code running on one CPU might have started only after code running on the other had locked both lock_a and lock_b. Running the program within a simple shell loop soon made the system freeze.


while true
./11 # Name of the test program
done

This is one of the problems with code containing deadlocks - sometimes they do not manifest themselves, even if we test thoroughly.

Sequence locks

Sequence locks can be used in situations where a writer should get immediate access to a critical section even in the presence of multiple readers. Readers should realize that they have got inconsistent results because of a writer interfering, and they should be prepared to try reading again and again till they get consistent results. A writer should never interfere while another writer is active in the critical section. The code which implements this locking scheme is fairly simple and has been introduced as part of the 2.6 kernel.

Lets first look at a `toy' program which demonstrates the use sequence locks. We have a writer process which keeps incrementing a 32 bit counter implemented as two independent 16 bit counters. The idea is to increment the higher 16 bits if the lower 16 bits overflows from 0xffff to 0x0. A reader process keeps reading the two 16 bits counters and generates a 32 bit count value. At no point of time should the reader process get a count which was lower than the previous count. The maximum count in the two 16 bit counters taken together at any point of time will always be less than 0xffffffff.

Two problems can give inconsistent results. Say the current value of the count is 0x6FFFF (stored as 0xFFFF and 0x6 in the two independent counters). The writer changes 0xFFFF to 0x0, which the reader reads and records as the lower 16 bits of the count. Next, if the reader reads the higher 16 bits before the writer increments it, reader would get 0x6. So the combined value will be 0x60000 which is less than the previous count 0x6FFFF. The problems here is that the reader is interrupting the writer.

Another scenario. Say current total is 0x2FFFF. The reader reads the higher 16 bits and gets 0x2. Before the reader can sum this up with the lower 16 bits, the writer starts running and changes 0x2FFFF to 0x30000, 0x30001 etc. It is possible that the reader will run again only when the count becomes say 0x30006. At this point, the reader will get the lower 16 bits, 0x0006. Reader combines both parts and gets 0x20006 which is less than 0x2FFFF.

[Listing 12, kernel module demonstrating reader/writer problem]

Here is a user space test program:

[Listing 13]

In situations where there are only a few writers and many readers running concurrently, where a writer is so impatient that he is not prepared to wait for any of the readers, sequence locks can be employed effectively. Examples of usage can be found in kernel/timer.c.

We will modify our program to use seq locks - we shall then look at how these locks are implemented.

[Listing 14, Using sequence locks to solver reader-writer problem]

Now let's look at the kernel implementation of the sequence lock.

[Listing 15]


typedef struct {
        unsigned sequence;
        spinlock_t lock;

} seqlock_t;

#define SEQLOCK_UNLOCKED { 0, SPIN_LOCK_UNLOCKED }

static inline void 
write_seqlock(seqlock_t *sl)
{

        spin_lock(&sl->lock);
        ++sl->sequence;
        smp_wmb();                      
}       

static inline void 
write_sequnlock(seqlock_t *sl) 
{
        smp_wmb();
        sl->sequence++;
        spin_unlock(&sl->lock);
}

static inline unsigned 
read_seqbegin(const seqlock_t *sl)
{
        unsigned ret = sl->sequence;
        smp_rmb();
        return ret;
}

static inline int 
read_seqretry(const seqlock_t *sl, unsigned iv)
{
        smp_rmb();
        return (iv & 1) | (sl->sequence ^ iv);
}

We note that seqlock_t is a structure which contains a sequence number and a spin lock - the initial sequence number will be 0 and the spin lock will be in the unlocked state. A writer will always have to acquire this spinlock to enter the critical section - this guards against other writers. Once the writer enters the critical section, it increments the sequence number. Once the writer leaves the critical section, the sequence number gets incremented again and the lock is unlocked. The important point here is that while the writer is operating, the sequence number would be odd. No writer in the critical section means sequence number will be even. A reader enters the critical section without grabbing any lock. If it is lucky, it wont be interrupting any writer and no writer would interrupt it. How does the reader know that it has been lucky enough? Simple, before doing its critical section, the reader will save away the sequence number of the associated lock. Now, after the critical section is over, the reader will check for two things:

Whether the saved sequence number was odd. If so, it means that the reader was running while a writer also was there in the critical section.
Whether the saved sequence number and the current sequence number are the same. If not, it means that a writer had interrupted the reading process.

If there has been no interruptions, the reader is happy with the value it has got - it's surely going to be a consistent value. Otherwise, the reader will redo the reading process till it gets a consistent result. The read_seqretry function is the heart of this technique. Its body looks like this:


return (iv & 1) | (sl->sequence ^ iv);

The first part checks whether iv, the old sequence number is even or odd. The second part checks whether the old and the new sequence numbers are identical. The smp_rmb, smp_wmb functions which you see in the code are to handle out of order loads and stores which I don't think Intel CPUs do - so those functions need not hinder your understanding of the code.

Conclusion

I have tried to present a few things about hyperthreading and concurrency control in this article. Concurrency is tricky business - and I am not an expert. I shall be grateful to readers who point out errors/inconsistencies, if any.