There is no guarantee that your questions here will ever be answered. Readers at confidential sites must provide permission to publish. However, you can be published anonymously - just let us know!
From Mike Orr
Answered By Ben Okopnik
Just got a disturbing disk error. It was on my 486 laptop, which I've only used for reading and writing text files on the past few years because of its limited capacity (16 MB RAM, 512 K HD).
1) I was in vi, and it caught a SEGV. Fortunately, it was able save its recovery file. I restarted vi, recovered the file, saved it, deleted the recovery file and went on typing. Then,
[Ben] Could be memory, could be HD...
2) I got an oops. Something about paging. I figured, common enough oops,
[Ben] Ah. This sounds like memory.
even though it's never happened on that computer, so I pulled out the power cable for a second and rebooted. (The battery had long ago stopped holding any charge.) Linux found that the HD had been mounted uncleanly (no duh) and started fsck. Fsck found two deleted files with zero dtime and fixed them. I was glad I had saved the file after recovering it since I'd deleted the recovery file. Then--
3) "Kernel panic: free list corrupted". I rebooted. Again the same error. What do you run when fsck doesn't work?? Is all my data gone bye-bye? Not that it was that much, and I was about to blast away the current (Debian) installation anyway and practice installing Rock Linux. (If, of course, the disk is good enough to be reformattable.)
4) A happy ending. I rebooted again to make sure I had the panic message right, and this time fsck completed and I got a login prompt. Quickly I tarred up my data and copied it onto a floppy.
I wonder if this will make Wacky Topic of the Month.
[Ben] Had that happen... oh, can't even remember now. Something crunchy happened, and required multiple fsck's. It would get a little further every time, and finally got it straightened out. IIRC, it took three or four reboots to get it - and I had exactly the same "if the salt have lost his savour, wherewith shall it be seasoned?" moment. Pretty scary to think that "fsck" doesn't work, just at the moment when it's the only thing that _can._ As far as I'm concerned, "fsck" should have a default "auto-restart" mode that can be interrupted with a 'Ctrl-C'; when it stops like that, the typical user's response isn't going to be "reboot and try again" - it's "Ohmygawd, MY MACHINE IS BROKEN!"
Doesn't fsck automatically restart sometimes? I know I've seen it do this, although the last time was early in the kernel 2.2 days. Is it an ex-feature? Or maybe Debian did it with a 'while' loop or something.
[Ben] Can't say. I've only had "fsck" run in 'repair mode' three times, all in the dim dark past; never saw it restart. I'm pretty sure all three were in, or before, the 2.0 days.
Of course, you can't interrupt an oops with a Ctrl-C. When an oops happens, the machine halts and must be reset.
[Ben] Hmm. Normal disk repair (fixing up inode dtimes and such) shouldn't produce an oops; theoretically, there is a large but fixed number of things that can be wrong, and there is supposed to be a programmatic response to each of them. The only reasons I could see for an oops to occur while "fsck" is running are 1) bad memory - which is an unrelated issue - or 2) the inode that contains "fsck" itself is damaged. Other than those, I can't see why a loop of the sort I suggested can't be written... really, I can't see ANY reason for "fsck" to freeze in the first place. It just sounds like some unaccounted-for cases that come up - and even that should be "catchable".
Sorry, I wasn't thinking clearly. An oops is most likely bad memory, a bad disk or cosmic rays. A kernel panic (in my experience) is more likely to be a programming, configuration or environment issue. In either case, the machine halts and you can't recover except by resetting it. What is curious is, is there a certain moment during disk activity where a SEGV or oops would leave the filesystem in a "free list corrupted" state? Intuitively, there must be.
[Ben] Mmmm... sure. I'm not a kernel expert by any means, but if the machine crashes while the free list is being updated, that would make it corrupt. Not that it's really a big deal, the way it would be if individual inode pointers got fried - but it's certainly a much better mechanism than FAT, where a couple of K worth of mis-written data can fry your entire drive contents.
The next question is, is it possible to retrieve the data after such an error (short of running a sector-by-sector analysis)? Apparently there is, and fsck does it, although it takes a couple runs to finish the repair.
[Ben] Sure; it would be a inode-by-inode analysis ("anything that's not a superblock, and is not owned by a file, and <a few other considerations that I can't think of at the moment> must be free space"), but a corrupted free list isn't that big of a thing. It's much easier to find out which blocks are really free, rather than trying to find which ones aren't _and_ how they're connected to the rest of the structure.
Too bad fsck can't somehow avoid causing a kernel panic or that the kernel can't figure out the situation enough to provide a more reassuring error message.
[Ben] Agreed. That kind of tools, the "fall back if all else fails" kind, should run flawlessly.
The worst fsck case Jim Dennis ever had against required him to run fsck 6 times, but it did eventually succeed in cleaning up the mess he had made. (He had told his video controller to use the address range which the hard disk controller actually owned. Typos can be really bad for you at that level.) The moral here is, if at first fsck does not succeed, don't give up all hope. You may prefer to reformat afterwards anyway, but you should get a decent chance to rescue your important data first. -- Heather
|1 2 3 4 5 6 7 8 9 10 11|