The original message in this thrread appeared in Issue 30, Linux Memory Usage vs. Leakage
From Thomas L. Gossard on Fri, 03 Jul 1998
Regarding the recent question you received on memory leakage under 2.0.29. I don't believe it is a memory leakage under the normal sense where a program quits and won't give the memory back to the OS.
Once a program has quit (exited) it is the OS' responsibility to reclaim all RAM and normalize all other resources (process table entries, filed descriptors and handles, etc) that were allocated to that process.
If it fails to do so, that is a bug in the OS (the kernel and/or its drivers or core user space processes, like 'init'). Under Linux (and Unix in general) it is very rare to see this sort of bug. (I've never heard of any kernel memory leaks in Linux).
Under NT there is apparently a problem because the system is very complex and so much of the programming doesn't respect the intended modularity between "kernel" and "user space" --- so DLLs and drivers, (particularly video drivers) will end up locked into memory with no references. Since I'm not an NT programmer (and not a systems programmer of any sort I'll have to accept the considered opinions of others who've said that this is why NT has a notorious poor stability record compared to any form of Unix. The fact that they've added some process memory protection and imposed some modularity and process isolation means that NT's stability is orders of magnitude better than MS-DOS, Windows 3.x, and Windows '95 ever were. However, it's reported to be very poor compared to any of the multi-user OS' like Unix or VMS.
I also use .29 and saw the same problem. I sent out several e-mails and found out that what is really happening is the OS has the memory but is not reporting it as free but has saved it for cache purposes. Notice the guy with the question said "ls" the first time took memory but not the second time. A memory leak will take the memory each time. The OS is keeping the memory for itself. The real problem is in the way the OS or top or whatever is reporting the memory usage and the way we expect to see it.
The way that memory is used by the Linux cache is fairly complex. Consequently the output from 'top' and 'free' and 'vmstat' are not easy to interpret (and I don't consider myself to be an expert in them by any means).
The intended design is supposed to use all "available" free memory for disk caching (and I guess the 2.2 kernels will implement disk and directory entry caching --- which should yield much better performance for several reasons). It is certainly possible that there were bugs in the caching and memory management code in some of the 2.0.x kernels. You could certainly go to the Linux kernel mailing list archives and read through the various change summaries to see. Or you could ugrade to a newer kernel and look for symptoms.
The only true way to check on the problem seems to be to execute
some memory hog routines, like graphics and watch the swap useage.
In particular my mail program seemed to suck up 8 or 9 megs at a
time yet even going in and out of that and xv my swap was barely
touched. With a sufficient memory leak after a period of time the
swap should see a great deal of activity due to the lack of memory.
Most memory leaks are in user space --- in long running daemons like 'named', a web server, 'sendmail', X, etc. Your test doesn't isolate the cuase of the memory leak. I think my message covered some suggestions to do that (like run with init=/bin/sh and run some tests from there)
If exiting doesn't return your memory to availability for cache/free space --- you have a problem in your kernel. However, it can be deceiving. For example --- I remember a situation where BIND ('named') was leaking --- and it looked like 'sendmail' was the culprit. In actuality 'sendmail' was making DNS queries on the named, causing it to lose it's cookies. (At the same time that 'sendmail' was segfaulting (dying a horrible death) because the old resolver libraries (against which it was linked) were return lots of MX records for sites like Compuserve and AOL (which back then had just started deploying dozens of mail servers each --- so that one DNS request would return more records than the resolver could handle).
At first I thought someone had discovered a new remote sendmail exploit and was hacking into my site (this was actually on an old SunOS box). Then I realized that it was related to DNS --- and finally I upgraded to a newer DNS and set of resolver libraries. The newer version of named still had a memory leak back then --- but my other sysadmin friends said "Oh yeah! It's been doing that --- just set up a 'cron' job to kill it once a day or so" (I'd been sure that it was my fault and that I'd built and installed it incorrectly).
As for the "true way" to look for memory leaks --- I think most programmers would disagree with your analysis on this one. They might suggest Electric Fence (a debugging form of the malloc() and new() calls that's designed to catch the sorts of allocation and reference problems that 'lint' won't --- and that might not be immediately fatal). Another option might be for someone to link this with Insure++ (http://www.linuxjournal.com/issue51/2951.html) and do their testing with that.
Certainly, we, as sysadmins are usually constrained to more hueristic and less "invasive" approaches --- but we definitely want to isolate the problem to a specific component (program, module, kernel configuration whatever) or combination. That's what "tech support" is all about.