The Answer Gang 101: Diagnosing a Linux crash

...making Linux just a little more fun!

By Jim Dennis, Ben Okopnik, Dan Wilder, Breen, Chris, and... (meet the Gang) ... the Editors of Linux Gazette... and You!

We have guidelines for asking and answering questions. Linux questions only, please.
We make no guarantees about answers, but you can be anonymous on request.
See also: The Answer Gang's Knowledge Base and the LG Search Engine

Diagnosing a Linux crash

From Tom Brown

Answered By: Thomas Adam, Karl-Heinz Herrmann

OK guys, here's a n00b question for you that probably crosses over into Sys Admin territory.

What steps should someone follow after Linux crashes to figure out what went wrong?

Where do I start, and where do I look for clues?

Are all the logs found in /var/log, or are there others?

In what order should I look at the logs, and what should I look for?

[Thomas] It depends what you think went wrong. Essentially:

/var/log/messages

is where syslogd will dump all its data and so is the best place to look. But there may well be application specific data in /var/log (XFree86.0.log) is one such example.

Any pro-active steps I should be taking to get more info, should it happen again?

The specifics of my case: my file server (a 750 Mhz Athlon running Suse 9) simply locked up, and I couldn't get anything to display (GUI or command line). I knew the machine was in trouble, when it didn't respond to pings. I had to hit the reset button to get it back (and deal with fsck, naturally). Funny thing is, the system clock reset itself to 28 minutes after midnight (when it should have read the middle of the afternoon), but didn't loose the date. Odd, that. The machine's been running 24/7 for about three weeks now (I set it up around then), and no sign of problems until now.

[Thomas] This might be framebuffer related. At the lilo/grub prompt, type:

linux video=vga16:off

[Thomas] to see if that has any effect.

There have been snippets of these effects mentioned in the past. The one that springs to mind is:

http://linuxgazette.net/issue74/tag/9.html

[K.-H] There are ways of still getting kernel info (pro active steps):

plug an old printer into the lpX port and declare it the system console (kernel compile parameter, and I don't know how exactly you activate it -- maybe inittab).
When running switch to system console (Alt-Ctrl-F10 on SuSE) and leave it there. It might show a kernel oops/panic there next crash.
search SuSE config for Magic SysRequest keys -- the function should be compiled in the kernel but has to be activated. Then you can press weird key-combinations like Alt-Ctrl-Sysreq-R for register dump, ...S for disk sync,... see /usr/src/linux/Documentation for details.
File server? What hardware? I had SCSI disks locking my system for various reasons (Tagged queuing incompatibilities of indiv. drives, too long cables,..)

I'm going to keep your response handy -- several things to try. Meantime, I realized I was booting the thing into runlevel 5 (rather stupid, actually), so I've since changed it to 3. If it is, as someone suggested, a framebuffer problem, maybe that will solve it for now. I'm using a real old Voodoo 3 card I scrounged from my parts bin. If it happens again, I'll have to tear the machine apart and start playing with the memory, as someone else here suggested.

install and configure Linux is one thing. Learning how to do an autopsy seems to be quite another!

[Thomas] That's because generally one doesn't do it quite like that. Problem diagnosis is situation dependant. In any given situation there is often a small set of files and related information that you can analyse without having to worry about the rest of the system.

Granted, this is related to how much information one is told at the time (if you've been on this list for as long as I have, you'll come to realise that usually we don't get any), and whether or not the person has tried to remedy it.

In general though, poking around, taking an aspect of your system, looking at what it does, and how is all related and helpful to you when you have to come to diagnose anything.

Yes, well, I looked at the messages log, but saw only a gap time-wise between cron processing around 4 in the morning, and the time of the crash. I'm not sure which of the other logs are important in that case. Where do I find the register dump (although I suspect it won't make much sense to me, rather like those register dumps you get in Windows XP)?

[Thomas] Syslogd might have logged it, if the problem was software related, and indeed if the said program produced any errors. If hardware then it might not have, depending on the severity of the hardware failure.

I'm using a real old Voodoo 3 card I scrounged from my parts bin. If it happens again, I'll have to tear the machine apart and start playing with the memory, as someone else here suggested.

[Thomas] It might be memory, but as the link I have you last time around said, memory problems tend to be more 'visible' in the sense that you get a lot of applications SEGFAULTing and SEGABRTing for no apparent reason. In such instances, installing and running 'memtest86' is usually of help.

[K.-H] Most of the time I had the great luck of oopses and kernel crashes occurring in the scsi layer, often hardware problems. If the scsi layer is in trouble nothing will get written to disk. What's software related regarding the kernel? The kernel deals with hardware, and it's supposed to handle error conditions gracefully, i.e. not just freeze without a hints whats gone wrong. But there are situations where the kernel doesn't have a chance of leaving hints on the hard drive.

Then a few thing might be useful: (to Tom B)

run the box without X as you suggested yourself
switch to console 10 (sys messages). Even if the kernel might not be able to leave a trace on the HD it might give a hint here.
for any reg dump on console 10 or syslog you need to run it through ksymoops to make it useful. That's something nobody can take over because it has to be done on your system, with your kernel and kernel symbols. I hope SuSE set everything needed up correctly. As you mentioned WinXX reg dumps -- in Linux they are about as useful as in WinXX, but Linux has the tools to decode them (ksymoops) to make them useful.
If you gain any information (and yes you will have to note it down on paper and give it to ksymoops after reboot) you can try here too with the kernel people.
this is an option to follow if you are interested why your system crashes. As it's crashing very irregular this is a rather difficult situation and a very slow process. But "the machine was dead on the next morning" wont help you next time it happens. Above mentioned things (along with running a printer or a serial line console) would also help in getting the syslog right up to the crash.

suggested reading:

/usr/src/linux/Documentation/oops-tracing.txt

ksymoops man page

But I have to say that often enough I also do not try to hunt spurious crashes which do occasionally happen. Either hardware causes or whatever. You always can try a different kernel or simply hope for the best.

Still -- keeping the system on console 10 is not a difficult thing to do and it just might give you something useful next time (note it down for ksymoops if it's a oops or panic).

SuSE has memtest as a boot option -- run it if you suspect the RAM, run it long (several passes) and the full test suite if you don't find any errors on the first go.

Thanks Karl and Thomas. This is the starting point I needed. (For one thing, I didn't even know about console10: looks helpful). I just wish I had more from the crash than just a black screen, but that's what I get for running X on bootup for a file server. Between the two of you, I think I have the answers I was looking for when I started this thread: not what went wrong exactly, but how to dig in, and try to figure it out for myself.

Oh, Thomas, when I rebooted to runlevel 3, I entered that video setting you suggested as well.

I just know I'll be back with more questions, though. One way or another, I'll figure this Linux thing out.

Thanks again, guys. Your help, as always, is much appreciated.

HTML script maintained by Heather Stern of Starshine Technical Services, http://www.starshine.org/

Meet the Gang 1 2 3 4 5 6 7

Diagnosing a Linux crash

This page edited and maintained by the Editors of Linux Gazette Copyright © its authors, 2004 Published in issue 101 of Linux Gazette April 2004

HTML script maintained by Heather Stern of Starshine Technical Services, http://www.starshine.org/

This page edited and maintained by the Editors of Linux Gazette
Copyright © its authors, 2004
Published in issue 101 of Linux Gazette April 2004