Tux

...making Linux just a little more fun!

Diagnosing SATA problems

Neil Youngman [ny at youngman.org.uk]


Sun, 1 Apr 2007 19:10:45 +0100

On or around Tuesday 23 January 2007 16:46, Benjamin A. Okopnik reorganised a bunch of electrons to form the message:

> On Mon, Jan 22, 2007 at 09:23:58PM +0000, Neil Youngman wrote:
> >
> > The lack of any errors suggests to me that the problem is not with the
> > disk; hence the thought that I should replace the controller. Is this a
> > reasonable conclusion from the data available?
> >
> > I have tried reseating the controller card and cables and moved the SATA
> > cable to the secondary port on the SATA controller.
> >
> > Is there anything else I should be trying?
>
> Coming at it from the hardware end, I'd say that you have the right
> idea: throwing in a different controller would be a pretty good test.
> Shotgunning does make sense as a troubleshooting technique, when the
> possible number of affected parts is low.

I've finally got round to putting in a new SATA controller, and it seems to have helped; it may even have solved the problem. Previously, trying to copy large amounts of data to the SATA disk would bring the system to a complete halt. The last time I tried to copy a 1.5GB file to the SATA disk, it died after 60MB. Now, I can copy the same file without any obvious problems.

There are still SATA errors on boot, which are a cause for concern, so I'll want to run for a while with that disk as my main disk before I'm totally confident.

The errors on boot look like

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: tag 0 cmd 0xb0 Emask 0x2 stat 0x50 err 0x0 (HSM violation)
ata2: soft resetting port
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: configured for UDMA/133
ata2: EH complete
and that repeats a dozen times. It doesn't mean much to me, so I guess I'll need to spend some time with Google.

Neil


Top    Back


René Pfeiffer [lynx at luchs.at]


Sun, 1 Apr 2007 21:11:10 +0200

On Apr 01, 2007 at 1910 +0100, Neil Youngman appeared and said:

> On or around Tuesday 23 January 2007 16:46, Benjamin A. Okopnik reorganised a
> bunch of electrons to form the message:
> > [...]
> > Coming at it from the hardware end, I'd say that you have the right
> > idea: throwing in a different controller would be a pretty good test.
> > Shotgunning does make sense as a troubleshooting technique, when the
> > possible number of affected parts is low.
>
> I've finally got round to putting in a new SATA controller, and it seems to
> have helped; it may even have solved the problem. Previously, trying to copy
> large amounts of data to the SATA disk would bring the system to a complete
> halt. The last time I tried to copy a 1.5GB file to the SATA disk, it died
> after 60MB. Now, I can copy the same file without any obvious problems.
>
> There are still SATA errors on boot, which are a cause for concern, so I'll
> want to run for a while with that disk as my main disk before I'm totally
> confident. [...]

Do you happen to have a Western Digital disk in your computer? A couple of days ago, I stumbled upon the description of a server outage and its story. They reference to a bug in the firmware of Western Digital drives:

http://www.voip-info.org/wiki/view/The+Great+Ides+of+March+VOIP-Info+Outage

[[[ I've changed René's original URL for this (from the Western Digital site) to a tinyurl, for convenience. -- Kat ]]]

http://preview.tinyurl.com/yrusj3

Best wishes, René.


Top    Back


Neil Youngman [ny at youngman.org.uk]


Sun, 1 Apr 2007 20:38:24 +0100

On or around Sunday 01 April 2007 20:11, René Pfeiffer reorganised a bunch of electrons to form the message:

> On Apr 01, 2007 at 1910 +0100, Neil Youngman appeared and said:
> > There are still SATA errors on boot, which are a cause for concern, so
> > I'll want to run for a while with that disk as my main disk before I'm
> > totally confident. [...]
>
> Do you happen to have a Western Digital disk in your computer? A couple
> of days ago, I stumbled upon the description of a server outage and its
> story. They reference to a bug in the firmware of Western Digital
> drives:

Interesting. Yes it is a Western Digital disk.

  Vendor: ATA       Model: WDC WD2000JD-22H  Rev: 08.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
That firmware bug appears to apply to WD1600YS, WD2500YS, WD4000YS, and WD5000YS and it isn't one of those, but I need to have a look around and see if there's a firmware update for this drive.

Thanks

Neil


Top    Back


Martin J Hooper [martinjh at blueyonder.co.uk]


Sun, 01 Apr 2007 20:52:52 +0100

Neil Youngman wrote:

> That firmware bug appears to apply to WD1600YS, WD2500YS,
> WD4000YS, and WD5000YS and it isn't one of those, but I need
> to have a look around and see if there's a firmware update for
> this drive.

Didn't know you could flash HD firmware... I know you can flash BIOS and DVD drives. I guess it's the same way of doing it, yes?


Top    Back


René Pfeiffer [lynx at luchs.at]


Sun, 1 Apr 2007 23:14:47 +0200

On Apr 01, 2007 at 2052 +0100, Martin J Hooper appeared and said:

> Neil Youngman wrote:
> > That firmware bug appears to apply to WD1600YS, WD2500YS,
> > WD4000YS, and WD5000YS and it isn't one of those, but I need
> > to have a look around and see if there's a firmware update for
> > this drive.
>
> Didn't know you could flash HD firmware...  I know you can flash
> BIOS and DVD drives. I guess it's the same way of doing it, yes?

A lot of things come with firmware these days. For me, this trend is a bit disturbing, since most firmware is proprietary and rather inaccessible for inspection or bug fixing. I've seen firmware images for routers, VoIP phones, cell phones, DVD-/CD burners, DVD-/CD-ROMs, DVD players, satellite equipment, modems, all kinds of I/O controllers, the various BIOS images of mainboards, and now we have HDs.

Usually, the vendor of the hardware offers a binary that can be used to update the firmware. This binary is often tailored for a DOS variant or MS Windows. In case of networked equipment, you can sometimes use DHCP and TFTP or a similar construct.

Best, René.


Top    Back


Neil Youngman [ny at youngman.org.uk]


Mon, 2 Apr 2007 21:23:52 +0100

On or around Sunday 01 April 2007 22:14, René Pfeiffer reorganised a bunch of electrons to form the message:

>
> A lot of things come with firmware these days. For me, this trend is a
> bit disturbing, since most firmware is proprietary and rather
> inaccessible for inspection or bug fixing. I've seen firmware images for
> routers, VoIP phones, cell phones, DVD-/CD burners, DVD-/CD-ROMs, DVD
> players, satellite equipment, modems, all kinds of I/O controllers, the
> various BIOS images of mainboards, and now we have HDs.

I find the fact that most things come with firmware neither surprising, nor worrying in itself. I believe the alternative to firmware in peripheral devices of all kinds is hard coded logic, which turns a bug into a permanent feature.

While I concur with your preference for free software, the use of firmware at least allows for updates. While the firmware may not be free, there is at least the possibility of replacing it with free software.

Neil


Top    Back