Linux Benchmarking - Article III - Interpreting benchmark results: Benchmarks for SMP systems

3. Benchmarks for SMP systems

3.1 Description of the problem

SMP (Symmetric MultiProcessing) has been implemented in the Linux kernel for Intel Pentium, Pentium MMX, Pentium Pro and Pentium II processors (4) and more recently for SPARC architectures. SMP systems are usually more expensive than their uniprocessor counterparts because they are frequently used to implement heavy-duty (possibly fault-tolerant) servers. For this reason potential buyers of such systems often want to make sure that applications, OS and hardware platform will be able to satisfy their needs in terms of overall performance before deciding on an expensive purchase. This is precisely where a Linux SMP benchmark would be useful. As this series of articles focuses on using current and stable 2.0.x kernels, we will only deal with what could be done for benchmarking Linux SMP systems with current Linux distributions.

Taking advantage of the additional computing power brought to the end-user by an SMP hardware platform puts constraints on almost all layers of the software involved: application, runtime libraries and operating system.

Basically two approaches are possible depending on how the application being considered is designed:

The application uses multiple simultaneously running processes. Those processes are very likely to communicate with each other using standard IPC (Inter-Process Communications) mechanisms.
The application is multi-threaded: for some of the related processes, multiple instances of sequential execution exist in the same address space.

The table below summarizes the impact of these two designs on the software layers involved, on the programming complexity and on the expected performance improvement (relatively to a comparable uniprocessor system):

Application	Multiple single-threaded processes	Multi-threaded application
Runtime libraries special requirements	None.	Libraries must be thread safe and should preferably offer some POSIX control over the threads.
Operating system special requirements: load balancing	Smart assignment of processes to processors must be implemented (static or dynamic).	An assignment mechanism of kernel-threads to processors must be supported
Example	make -j 4 vmlinux	None available AFAIK.
Additional programming complexity	None.	Greater than for single-hreaded applications, but it can be done by us mere mortals.
Expected performance improvement	Average to poor.	High (close to linear speedup) for CPU bound applications but can also degrade to become as low as single processor performance for system call intensive applications.

How do those issues relate to current stable Linux kernels?

Good results obtained from a Linux multi-threaded benchmark would be very interesting for power users.

3.2 Runtime issues

Threads can be implemented at the user-level as coroutines (e.g. the LinuxThreads package), or can be kernel threads (i.e. threads running in user mode but scheduled by the kernel). Until the very recent release of Glibc 2.0 which RedHat 5.0 includes as its standard C library, finding a thread safe runtime library could be a tough job.

3.3 Scheduling issues

The issue here is the way scheduling is implemented on SMP platforms by the current stable kernels. Quoting its implementor Alan Cox (in a paper he wrote in 1995):

"A single lock is maintained across all processors. This lock is required to access the kernel space. Any processor may hold it and once it is held may also re-enter the kernel for interrupts and other services whenever it likes until the lock is relinquished. This lock ensures that a kernel mode process will not be pre-empted and ensures that blocking interrupts in kernel mode behaves correctly. This is guaranteed because only the processor holding the lock can be in kernel mode, only kernel mode processes can disable interrupts and only the processor holding the lock may handle an interrupt."

So a correct interpretation of this is: right now, no more than a single process may be executing in kernel mode (i.e. executing a system call) at any given time.

But efforts are underway to improve the granularity of locking in future 2.2.x kernels. We should also soon be able to take interrupts without having to take a lock. This should result in much better performance of system call intensive applications on SMP systems running GNU/Linux.

3.4 Further reading/links

"An Implementation Of Multiprocessor Linux", Alan Cox, 1995. I found this TeX article in the Linux source tree (kernel 2.0.33 source in /Documentation/smp.tex).
A FAQ about the clone() Linux system call.
A clone() utilization example
LinuxThreads: a package that implements user-level threads under Linux.

3.5 Benchmark availability

If we stick to our guideline for simple, quick running, readily available benchmarks (or more simply, K.I.S.S. benchmarks), we can use a modified version of the Linux kernel 2.0.0 compilation benchmark (described in article II), now for SMP systems. Andy Kahn provided us with this test and some very interesting results. Quoting directly from some email we exchanged on this subject:

"...actually, it's pretty simple. GNU "make" has an option you can specify to use multiple processes (either a default number or a user specified number).I don't have the man page handy right now, but i'm pretty sure it's either the -j option or the -p option (actually, i think both options have some importance to multiple processes). Once you specify multiple make processes, each process will have gcc compiling something (so in effect, it's just multiple gcc processes).

(later)

"Andre Derrick Balsa" wrote:

-> Great news :-)

->

-> Thanks to Andy who actually tried this on a dual PPro SMP system and

-> explained the whole thing to me, I am pleased to announce a version of

-> the Linux 2.0.0 kernel compilation application benchmark for SMP

-> systems:

->

-> Just replace the "make vmlinux" (was "make zImage") by "make -j n

-> vmlinux". Replace n by 2, 3 ... and make will launch 2, 3 ... processes

-> in parallel. Since Linux SMP will transparently distribute processes

-> between the SMP processors, there is no need to program anything special

-> in terms of message-passing, clone(), etc...

->

-> Andy doesn't have any exact figures available, but it seems this would

-> provide a 30% decrease in compilation time (over a single serialized

-> process). Thanks, Andy. :-)

->

and because I don't have any exact figures, I decided that I would go and get some exact figures. :)

The system tested was:

Dual Pentium Pro 180MHz overclocked to 200MHz 64MB EDO RAM

Linux 2.0.27 gcc v2.7.2.1 libc v5.3.12

hda: QUANTUM TRB850A, 810MB w/96kB Cache, LBA, CHS=823/32/63

This is more or less your "standard" PC from about 13-14 months ago. I'm not at liberty to upgrade the software on this system, so this is as good as it gets from me with this setup.

Also, instead of doing a "sync" before issuing the final "make" command, I propose that if the circumstances allow it (you have root access), then umount the file system, remount it, then go back to that directory and build the kernel.

--- THE RESULTS! ---

"time make vmlinux" 107.32user 149.01system 4:27.91elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (143472major+167951minor)pagefaults 0swaps

"time make -j 2 vmlinux" 131.13user 177.77system 3:28.34elapsed 148%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (169498major+168582minor)pagefaults 8903swaps

Ugh, the results are terrible (only a 22% improvement)!! Note that in the SMP case, CPU usage was only 148%. From this, we can see that the 2nd CPU wasn't really used all that much (efficiently)."

I really appreciated Andy's attitude: not only he improved on my previous test procedure, but he went right ahead and produced some nice experimental data to go with it! Plus one can feel how enthusiastic he was at doing some hands-on experimentation!

Another nice feature of this simple SMP benchmark is that it provides a basis for performance comparisons between uniprocessor and SMP GNU/Linux systems.

Two more benchmarks would deserve a thorough description, but I will just mention them here:

UnixBench 4.1 has some tests that will launch simultaneous processes.
A rather complex, but complete Unix benchmark suite developed in France, called SSBA. François is working on a Linux port of the latest 2.4F revision.