Tux

...making Linux just a little more fun!

Diagnosing SATA problems

Neil Youngman [ny at youngman.org.uk]
Mon, 22 Jan 2007 21:23:58 +0000

I've been having problems with my SATA disk for some time and I've moved back to working off my old IDE disk, while I investigate the problem. I'm assuming the problem is hardware, but I don't have any suitable hardware to swap around to prove the point. I think my next step is to buy another SATA controller and swap that, but first I thought I'd see if the gang's collective wisdom had any pointers to offer.

First off, here's an extract from /var/log/messages as I try to copy a 1.4GB file onto the SATA disk.

Jan 22 15:52:57 tsr2 kernel: ata2: hard resetting port
Jan 22 15:52:57 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:52:57 tsr2 kernel: ata2.00: configured for UDMA/100
Jan 22 15:52:57 tsr2 kernel: ata2: EH complete
Jan 22 15:52:57 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:52:57 tsr2 kernel: sda: Write Protect is off
Jan 22 15:52:57 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:52:58 tsr2 kernel: ata2: hard resetting port
Jan 22 15:52:59 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:52:59 tsr2 kernel: ata2.00: configured for UDMA/100
Jan 22 15:52:59 tsr2 kernel: ata2: EH complete
Jan 22 15:52:59 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:52:59 tsr2 kernel: sda: Write Protect is off
Jan 22 15:52:59 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:52:59 tsr2 kernel: ata2: hard resetting port
Jan 22 15:52:59 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:52:59 tsr2 kernel: ata2.00: configured for UDMA/100
Jan 22 15:52:59 tsr2 kernel: ata2: EH complete
Jan 22 15:52:59 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:52:59 tsr2 kernel: sda: Write Protect is off
Jan 22 15:52:59 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:53:01 tsr2 kernel: ata2.00: limiting speed to UDMA/66
Jan 22 15:53:01 tsr2 kernel: ata2: hard resetting port
Jan 22 15:53:02 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:53:02 tsr2 kernel: ata2.00: configured for UDMA/66
Jan 22 15:53:02 tsr2 kernel: ata2: EH complete
Jan 22 15:53:02 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:53:02 tsr2 kernel: sda: Write Protect is off
Jan 22 15:53:02 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:53:04 tsr2 kernel: ata2.00: limiting speed to UDMA/44
Jan 22 15:53:04 tsr2 kernel: ata2: hard resetting port
Jan 22 15:53:05 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:53:05 tsr2 kernel: ata2.00: configured for UDMA/44
Jan 22 15:53:05 tsr2 kernel: ata2: EH complete
Jan 22 15:53:05 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:53:05 tsr2 kernel: sda: Write Protect is off
Jan 22 15:53:05 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:53:09 tsr2 kernel: ata2.00: limiting speed to UDMA/33
Jan 22 15:53:09 tsr2 kernel: ata2: hard resetting port
Jan 22 15:53:10 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:53:10 tsr2 kernel: ata2.00: configured for UDMA/33
Jan 22 15:53:10 tsr2 kernel: ata2: EH complete
Jan 22 15:53:10 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:53:10 tsr2 kernel: sda: Write Protect is off
Jan 22 15:53:10 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:53:11 tsr2 kernel: ata2.00: limiting speed to UDMA/25
Jan 22 15:53:11 tsr2 kernel: ata2: hard resetting port
Jan 22 15:53:12 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:53:12 tsr2 kernel: ata2.00: configured for UDMA/25
Jan 22 15:53:12 tsr2 kernel: ata2: EH complete
Jan 22 15:53:12 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:53:12 tsr2 kernel: sda: Write Protect is off
Jan 22 15:53:12 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:53:16 tsr2 kernel: ata2.00: limiting speed to UDMA/16
Jan 22 15:53:16 tsr2 kernel: ata2: hard resetting port
Jan 22 15:53:17 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:53:17 tsr2 kernel: ata2.00: configured for UDMA/16
Jan 22 15:53:17 tsr2 kernel: ata2: EH complete
Jan 22 15:53:17 tsr2 kernel: SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
Jan 22 15:53:17 tsr2 kernel: sda: Write Protect is off
Jan 22 15:53:17 tsr2 kernel: SCSI device sda: drive cache: write back
Jan 22 15:53:18 tsr2 kernel: ata2.00: limiting speed to PIO4
Jan 22 15:53:18 tsr2 kernel: ata2: hard resetting port
Jan 22 15:53:19 tsr2 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 22 15:53:19 tsr2 kernel: ata2.00: configured for PIO4
Jan 22 15:53:19 tsr2 kernel: ata2: EH complete
At this point 46MB has been copied and the machine is effectively hung.

Once I realised I had a problem, i naturally installed smartmontools and this is what smartctl tells me.

# smartctl -d ata -l selftest /dev/sda
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline       Completed without error       00%      1685         -
# 2  Short offline       Completed without error       00%      1671         -
# 3  Short offline       Completed without error       00%      1667         -
# 4  Extended offline    Completed without error       00%      1661         -
# 5  Short offline       Completed without error       00%      1643         -
# 6  Short offline       Completed without error       00%      1628         -
# 7  Short offline       Completed without error       00%      1613         -
# 8  Short offline       Completed without error       00%      1598         -
# 9  Short offline       Completed without error       00%      1583         -
#10  Short offline       Completed without error       00%      1569         -
#11  Short offline       Completed without error       00%      1555         -
#12  Extended offline    Completed without error       00%      1549         -
#13  Short offline       Completed without error       00%      1538         -
#14  Extended offline    Completed without error       00%      1523         -
 
# smartctl -d ata -l error /dev/sda
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
 
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
 
# 
The lack of any errors suggests to me that the problem is not with the disk, hence the thought that I should replace the controller. Is this a reasonable conclusion from the data available?

I have tried reseating the controller card and cables and moved the SATA cable to the secondary port on the SATA controller.

Is there anything else I should be trying?

Neil Youngman


Top    Back


Benjamin A. Okopnik [ben at linuxgazette.net]
Tue, 23 Jan 2007 11:46:36 -0500

On Mon, Jan 22, 2007 at 09:23:58PM +0000, Neil Youngman wrote:

> 
> # smartctl -d ata -l selftest /dev/sda
> smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  
> LBA_of_first_error
> # 1  Short offline       Completed without error       00%      1685         -
> # 2  Short offline       Completed without error       00%      1671         -
> # 3  Short offline       Completed without error       00%      1667         -
> # 4  Extended offline    Completed without error       00%      1661         -
> # 5  Short offline       Completed without error       00%      1643         -
> # 6  Short offline       Completed without error       00%      1628         -
> # 7  Short offline       Completed without error       00%      1613         -
> # 8  Short offline       Completed without error       00%      1598         -
> # 9  Short offline       Completed without error       00%      1583         -
> #10  Short offline       Completed without error       00%      1569         -
> #11  Short offline       Completed without error       00%      1555         -
> #12  Extended offline    Completed without error       00%      1549         -
> #13  Short offline       Completed without error       00%      1538         -
> #14  Extended offline    Completed without error       00%      1523         -
> 
> # smartctl -d ata -l error /dev/sda
> smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF READ SMART DATA SECTION ===
> SMART Error Log Version: 1
> No Errors Logged
> 
> # 
> 
> The lack of any errors suggests to me that the problem is not with the disk, 
> hence the thought that I should replace the controller. Is this a reasonable 
> conclusion from the data available?
> 
> I have tried reseating the controller card and cables and moved the SATA cable 
> to the secondary port on the SATA controller. 
> 
> Is there anything else I should be trying?

I'm wondering what would happen if you ran the above test in a logged loop while loading the disk - say, by running a large copy operation in another terminal. You'd build up a pretty good sized logfile after a while, but you might get some failure info that might point the way to a solution.

Coming at it from the hardware end, I'd say that you have the right idea: throwing in a different controller would be a pretty good test. Shotgunning does make sense as a troubleshooting technique when the possible number of affected parts is low.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top    Back