# ReiserFS and 2TB disk

## binro

Two weeks ago I upgraded two ageing disks to a single 2TB Seagate (ST2000DM001-9YN164). I use LVM and formatted the LVs with ReiserFS. In particular, the /home partition is 1TB. Having restored my system everything looked fine but returning after several hours, the KDE desktop would not wakeup properly. Switching to a console, neither the sync or umount command would complete, they just hung. This happened a couple of times, so I thought the backup might have been a bit corrupt and completely reinstalled @system and @world, and built the latest kernel-3.6.8. Returning last night the same thing had occurred; looking at htop from a console I could see lots of identical processes that had been started and just hung. In the syslog I could see kernel messages relating to hung tasks:

Dec  7 00:39:49 opal kernel: INFO: task apache2:28115 blocked for more than 120 seconds.

Dec  7 00:39:49 opal kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Dec  7 00:39:49 opal kernel: apache2         D 0000000000000000     0 28115  10408 0x00000000

Dec  7 00:39:49 opal kernel: ffff88012f27d980 0000000000000086 0000000000000002 ffffffff815b7420

Dec  7 00:39:49 opal kernel: 0000000000011280 ffff8800b6727fd8 0000000000011280 ffff8800b6726010

Dec  7 00:39:49 opal kernel: ffff8800b6727fd8 0000000000011280 ffff88012f27d980 0000000000011280

Dec  7 00:39:49 opal kernel: Call Trace:

Dec  7 00:39:49 opal kernel: [<ffffffff8106ac92>] ? load_balance+0x102/0x790

Dec  7 00:39:49 opal kernel: [<ffffffff8107a609>] ? debug_mutex_add_waiter+0x29/0x70

Dec  7 00:39:49 opal kernel: [<ffffffff814312cf>] ? __mutex_lock_slowpath+0x22f/0x310

Dec  7 00:39:49 opal kernel: [<ffffffff8102c455>] ? default_spin_lock_flags+0x5/0x10

Dec  7 00:39:49 opal kernel: [<ffffffff8143401b>] ? _raw_spin_lock_irqsave+0x3b/0x60

Dec  7 00:39:49 opal kernel: [<ffffffff8118cd81>] ? queue_log_writer+0x91/0xe0

Dec  7 00:39:49 opal kernel: [<ffffffff81066a80>] ? try_to_wake_up+0x2b0/0x2b0

Dec  7 00:39:49 opal kernel: [<ffffffff81192a18>] ? do_journal_begin_r+0x238/0x380

Dec  7 00:39:49 opal kernel: [<ffffffff81192bef>] ? journal_begin+0x8f/0x170

Dec  7 00:39:49 opal kernel: [<ffffffff81173e49>] ? reiserfs_create+0xf9/0x260

Dec  7 00:39:49 opal kernel: [<ffffffff8110ab1f>] ? generic_permission+0xff/0x240

Dec  7 00:39:49 opal kernel: [<ffffffff8110ce29>] ? vfs_create+0xb9/0x110

Dec  7 00:39:49 opal kernel: [<ffffffff8110e1c2>] ? do_last+0x9b2/0xe70

Dec  7 00:39:49 opal kernel: [<ffffffff810c57b0>] ? release_pages+0x180/0x1d0

Dec  7 00:39:49 opal kernel: [<ffffffff8110e741>] ? path_openat+0xc1/0x500

Dec  7 00:39:49 opal kernel: [<ffffffff8110ecad>] ? do_filp_open+0x4d/0xc0

Dec  7 00:39:49 opal kernel: [<ffffffff81433cf5>] ? _raw_spin_unlock+0x15/0x40

Dec  7 00:39:49 opal kernel: [<ffffffff8111b686>] ? alloc_fd+0x106/0x130

Dec  7 00:39:49 opal kernel: [<ffffffff810fd2e8>] ? do_sys_open+0x108/0x1f0

Dec  7 00:39:49 opal kernel: [<ffffffff81434a39>] ? system_call_fastpath+0x16/0x1b

Eventually the system just hangs completely. Since this started with the new disk, I am wondering if ReiserFS actually works with new, huge disks. If not, what else could be causing this? This is a bit desperate.  :Sad: 

TIA

----------

## Merlin-TC

Sawadee Binro,

reiserfs doesn't have any problems with volumes up to 16tb so I doubt reiserfs itself is the problem.

1. Is there any additional output of dmesg?

2. Can you reproduce it or does it feel "random"?

3. Is the system under heavy load when this is happening?

You could try another io scheduler just to narrow down the problem.

----------

## srs5694

You might also run a SMART utility like GSmartControl, the SMART functions of Palimpsest, or smartctl. These will tell you if you've got a new disk that's defective. (Sadly, it happens sometimes.) The output can be difficult to interpret sometimes, though, so post for help interpreting the output if you need it.

----------

## binro

 *Merlin-TC wrote:*   

> Sawadee Binro,
> 
> reiserfs doesn't have any problems with volumes up to 16tb so I doubt reiserfs itself is the problem.
> 
> 1. Is there any additional output of dmesg?
> ...

 

I examined the syslog and everything looks normal, there is no unusual load. It is not random, but inevitable. I am beginning to suspect it is caused by the graphics, the nvidia driver or KDE in some way, the system is stable if I don't logon. But this never happened before I changed the disk.

Khawp khun khrup!

----------

## binro

 *srs5694 wrote:*   

> You might also run a SMART utility like GSmartControl, the SMART functions of Palimpsest, or smartctl. These will tell you if you've got a new disk that's defective. (Sadly, it happens sometimes.) The output can be difficult to interpret sometimes, though, so post for help interpreting the output if you need it.

 

The smartd daemon is running and reports the disk to be entirely healthy!

----------

## binro

This gets stranger and stranger. I disabled the screen-saver and now the system is stable again! A screen-saver wouldn't interfere with process execution, would it?

----------

## srs5694

 *binro wrote:*   

> This gets stranger and stranger. I disabled the screen-saver and now the system is stable again! A screen-saver wouldn't interfere with process execution, would it?

 

It might, especially if it uses an advanced video feature and if that feature has a buggy implementation in a video driver. Video drivers are increasingly relying on kernel-level code, and then all bets are off; a buggy kernel driver could interfere with just about anything.

Thus, you might try upgrading your video driver, if possible, or switch drivers (from Nvidia's proprietary driver to nouveau or vice-versa, for instance). If that's too much hassle or otherwise impractical, try adjusting your screen saver to use just one module that does the simplest thing possible -- ideally just blank the screen. You'll do without the eye candy that way, but that's better than having a system that hangs randomly.

----------

## binro

I was thinking along the same lines, except that before the restore onto the new disk this all worked perfectly. I can't help thinking that something in my system has been subtly corrupted.

----------

## srs5694

How did you transfer your system to the new disks? (dd, tar, etc.?) It could be there's a malfunction in the video drivers that's related to a subtle permission problem introduced in the transfer; or maybe a bit or two got flipped during the copying. If you've still got the original disk, you could plug it in and write a script to compare every file. between the two systems.

----------

## salahx

Actually looking at the stack trace and explanation of symptoms, this could be a genuine bug. It sounds like there a race condition in reiserfs that's causing a deadlock. The screen saver being innocent in this matter - it just happens to widen the window the race can occur. 

It may worth recompiling the kernel with CONFIG_PROVE_LOCKING=y

----------

## binro

 *srs5694 wrote:*   

> How did you transfer your system to the new disks? (dd, tar, etc.?) It could be there's a malfunction in the video drivers that's related to a subtle permission problem introduced in the transfer; or maybe a bit or two got flipped during the copying. If you've still got the original disk, you could plug it in and write a script to compare every file. between the two systems.

 

The system is backed up using dar, which is a sound utility and checks the backup against the original disk every time.

----------

## binro

 *salahx wrote:*   

> Actually looking at the stack trace and explanation of symptoms, this could be a genuine bug. It sounds like there a race condition in reiserfs that's causing a deadlock. The screen saver being innocent in this matter - it just happens to widen the window the race can occur. 
> 
> It may worth recompiling the kernel with CONFIG_PROVE_LOCKING=y

 

Thanks, I will try that.

----------

## binro

I am back looking at this again. The lock proving idea did not work because the kernel disabled it when the evil NVidia binary module tainted the kernel! I am now seeing this in the logging:

Feb 18 06:04:00 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:00 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:00 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:00 opal kernel: ata1.00: cmd 25/00:18:f8:7b:57/00:00:93:00:00/e0 tag 0 dma 12288 in

Feb 18 06:04:00 opal kernel: res 51/40:00:f8:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:00 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:00 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:03 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:03 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:03 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:03 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:03 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:03 opal kernel: 93 57 7b f8 

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:03 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:03 opal kernel: Read(10): 28 00 93 57 7b f8 00 00 18 00

Feb 18 06:04:03 opal kernel: end_request: I/O error, dev sda, sector 2471984120

Feb 18 06:04:03 opal kernel: ata1: EH complete

Feb 18 06:04:03 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:03 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:03 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:03 opal kernel: ata1.00: cmd 25/00:08:f8:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in

Feb 18 06:04:03 opal kernel: res 51/40:00:f8:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:03 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:03 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:03 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:03 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:03 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:03 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:03 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:03 opal kernel: 93 57 7b f8 

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:03 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:03 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:03 opal kernel: Read(10): 28 00 93 57 7b f8 00 00 08 00

Feb 18 06:04:03 opal kernel: end_request: I/O error, dev sda, sector 2471984120

Feb 18 06:04:03 opal kernel: Buffer I/O error on device dm-3, logical block 9603455

Feb 18 06:04:03 opal kernel: ata1: EH complete

Feb 18 06:04:07 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:07 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:07 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:07 opal kernel: ata1.00: cmd 25/00:10:20:7c:57/00:00:93:00:00/e0 tag 0 dma 8192 in

Feb 18 06:04:07 opal kernel: res 51/40:00:20:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:07 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:07 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:07 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:07 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:07 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:07 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:07 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:07 opal kernel: 93 57 7c 20 

Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:07 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:07 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:07 opal kernel: Read(10): 28 00 93 57 7c 20 00 00 10 00

Feb 18 06:04:07 opal kernel: end_request: I/O error, dev sda, sector 2471984160

Feb 18 06:04:07 opal kernel: ata1: EH complete

Feb 18 06:04:10 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:10 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:10 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:10 opal kernel: ata1.00: cmd 25/00:08:20:7c:57/00:00:93:00:00/e0 tag 0 dma 4096 in

Feb 18 06:04:10 opal kernel: res 51/40:00:20:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:10 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:10 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:10 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:10 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:10 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:10 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:10 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:10 opal kernel: 93 57 7c 20 

Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:10 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:10 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:10 opal kernel: Read(10): 28 00 93 57 7c 20 00 00 08 00

Feb 18 06:04:10 opal kernel: end_request: I/O error, dev sda, sector 2471984160

Feb 18 06:04:10 opal kernel: Buffer I/O error on device dm-3, logical block 9603460

Feb 18 06:04:10 opal kernel: ata1: EH complete

Feb 18 06:04:13 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:13 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:13 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:13 opal kernel: ata1.00: cmd 25/00:20:48:7c:57/00:00:93:00:00/e0 tag 0 dma 16384 in

Feb 18 06:04:13 opal kernel: res 51/40:00:48:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:13 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:13 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:13 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:13 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:13 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:13 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:13 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:13 opal kernel: 93 57 7c 48 

Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:13 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:13 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:13 opal kernel: Read(10): 28 00 93 57 7c 48 00 00 20 00

Feb 18 06:04:13 opal kernel: end_request: I/O error, dev sda, sector 2471984200

Feb 18 06:04:13 opal kernel: ata1: EH complete

Feb 18 06:04:16 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:16 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:16 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:16 opal kernel: ata1.00: cmd 25/00:08:48:7c:57/00:00:93:00:00/e0 tag 0 dma 4096 in

Feb 18 06:04:16 opal kernel: res 51/40:00:48:7c:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:16 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:16 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:16 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:16 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:16 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:16 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:16 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:16 opal kernel: 93 57 7c 48 

Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:16 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:16 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:16 opal kernel: Read(10): 28 00 93 57 7c 48 00 00 08 00

Feb 18 06:04:16 opal kernel: end_request: I/O error, dev sda, sector 2471984200

Feb 18 06:04:16 opal kernel: Buffer I/O error on device dm-3, logical block 9603465

Feb 18 06:04:16 opal kernel: ata1: EH complete

Feb 18 06:04:26 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:26 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:26 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:26 opal kernel: ata1.00: cmd 25/00:08:60:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in

Feb 18 06:04:26 opal kernel: res 51/40:00:60:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:26 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:26 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:26 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:26 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:26 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:26 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:26 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:26 opal kernel: 93 57 7b 60 

Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:26 opal kernel: Add. Sense: Unrecovered read error - auto reallocate failed

Feb 18 06:04:26 opal kernel: sd 0:0:0:0: [sda] CDB: 

Feb 18 06:04:26 opal kernel: Read(10): 28 00 93 57 7b 60 00 00 08 00

Feb 18 06:04:26 opal kernel: end_request: I/O error, dev sda, sector 2471983968

Feb 18 06:04:26 opal kernel: ata1: EH complete

Feb 18 06:04:29 opal kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 18 06:04:29 opal kernel: ata1.00: BMDMA stat 0x25

Feb 18 06:04:29 opal kernel: ata1.00: failed command: READ DMA EXT

Feb 18 06:04:29 opal kernel: ata1.00: cmd 25/00:08:60:7b:57/00:00:93:00:00/e0 tag 0 dma 4096 in

Feb 18 06:04:29 opal kernel: res 51/40:00:60:7b:57/40:00:93:00:00/00 Emask 0x9 (media error)

Feb 18 06:04:29 opal kernel: ata1.00: status: { DRDY ERR }

Feb 18 06:04:29 opal kernel: ata1.00: error: { UNC }

Feb 18 06:04:29 opal kernel: ata1.00: configured for UDMA/133

Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda] Unhandled sense code

Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:29 opal kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Feb 18 06:04:29 opal kernel: sd 0:0:0:0: [sda]  

Feb 18 06:04:29 opal kernel: Sense Key : Medium Error [current] [descriptor]

Feb 18 06:04:29 opal kernel: Descriptor sense data with sense descriptors (in hex):

Feb 18 06:04:29 opal kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

Feb 18 06:04:29 opal kernel: 93 57 7b 60 

This was during a nightly backup. Also...

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, 112 Currently unreadable (pending) sectors

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, 112 Offline uncorrectable sectors

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate

changed from 117 to 108

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 190 Airflow_Temperature_Cel

changed from 57 to 60

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius

changed from 43 to 40

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, previous self-test completed with error (read

test element)

Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, Self-Test Log error count increased from 2

to 3

Feb 18 17:24:08 opal smartd[10040]: Sending warning via mail to root@localhost ...

Feb 18 17:24:09 opal smartd[10040]: Warning via mail to root@localhost: successful

Feb 18 17:24:09 opal smartd[10040]: Device: /dev/sda, ATA error count increased from 107 to 123

Signs of a failing disk?

----------

## Merlin-TC

I wouldn't say it's a sign of a failing disk but it is failing right now.

If there is anything important on it copy it off while you can.

It also seems as if your hard drive doesn't have any spare sectors as well so you really should replace it.

This is a hardware error for sure.

It could of course be a faulty cable/sata port but I doubt it.

----------

## NeddySeagoon

binro,

the output of smartctl -a for that drive would be good.

```
Feb 18 17:24:08 opal smartd[10040]: Device: /dev/sda, SMART Usage Attribute: 190 Airflow_Temperature_Cel

changed from 57 to 60 
```

Cooling air at 60C over a disk.  I would be worried if mine went over 40C.

----------

## binro

 #  smartctl -a /dev/sda

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.7.7-gentoo] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===

Device Model:     ST2000DM001-9YN164

Serial Number:    S1E0MATD

LU WWN Device Id: 5 000c50 0517daeab

Firmware Version: CC4B

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Tue Feb 19 03:04:35 2013 ICT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 121) The previous self-test completed having

                                        the read element of the test failed.

Total time to complete Offline 

data collection:                (  575) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:        (   1) minutes.

Extended self-test routine

recommended polling time:        ( 226) minutes.

Conveyance self-test routine

recommended polling time:        (   2) minutes.

SCT capabilities:              (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       16533576

  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       32

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       82984983

  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2028

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       32

183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0

184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       123

188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   057   053   045    Old_age   Always       -       43 (Min/Max 35/44)

191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       23

193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       52

194 Temperature_Celsius     0x0022   043   047   000    Old_age   Always       -       43 (0 27 0 0 0)

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       112

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       112

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       247445950826471

241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       96929235318

242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       606478393055

SMART Error Log Version: 1

ATA Error Count: 123 (device log contains only the most recent five errors)

        CR = Command Register [HEX]

        FR = Features Register [HEX]

        SC = Sector Count Register [HEX]

        SN = Sector Number Register [HEX]

        CL = Cylinder Low Register [HEX]

        CH = Cylinder High Register [HEX]

        DH = Device/Head Register [HEX]

        DC = Device Command Register [HEX]

        ER = Error register [HEX]

        ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 123 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00   1d+11:21:48.160  READ DMA EXT

  c8 00 18 78 97 ff e9 00   1d+11:21:48.159  READ DMA

  c8 00 18 50 97 ff e9 00   1d+11:21:48.142  READ DMA

  25 00 10 ff ff ff ef 00   1d+11:21:48.142  READ DMA EXT

  25 00 08 ff ff ff ef 00   1d+11:21:48.138  READ DMA EXT

Error 122 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 10 ff ff ff ef 00   1d+11:21:45.114  READ DMA EXT

  35 00 80 ff ff ff ef 00   1d+11:21:45.113  WRITE DMA EXT

  35 00 10 ff ff ff ef 00   1d+11:21:45.113  WRITE DMA EXT

  35 00 08 ff ff ff ef 00   1d+11:21:45.113  WRITE DMA EXT

  35 00 08 ff ff ff ef 00   1d+11:21:45.113  WRITE DMA EXT

Error 121 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00   1d+11:21:41.561  READ DMA EXT

  c8 00 08 38 31 4f ea 00   1d+11:21:41.551  READ DMA

  c8 00 30 90 99 ff e9 00   1d+11:21:41.550  READ DMA

  c8 00 70 18 99 ff e9 00   1d+11:21:41.537  READ DMA

  25 00 08 ff ff ff ef 00   1d+11:21:41.526  READ DMA EXT

Error 120 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 10 ff ff ff ef 00   1d+11:21:38.611  READ DMA EXT

  ea 00 00 ff ff ff af 00   1d+11:21:38.581  FLUSH CACHE EXT

  35 00 08 ff ff ff ef 00   1d+11:21:38.581  WRITE DMA EXT

  25 00 08 ff ff ff ef 00   1d+11:21:38.566  READ DMA EXT

  ea 00 00 ff ff ff af 00   1d+11:21:38.533  FLUSH CACHE EXT

Error 119 occurred at disk power-on lifetime: 2007 hours (83 days + 15 hours)

  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00   1d+11:21:35.045  READ DMA EXT

  25 00 08 ff ff ff ef 00   1d+11:21:35.028  READ DMA EXT

  35 00 08 ff ff ff ef 00   1d+11:21:35.028  WRITE DMA EXT

  35 00 20 ff ff ff ef 00   1d+11:21:35.028  WRITE DMA EXT

  35 00 08 ff ff ff ef 00   1d+11:21:35.028  WRITE DMA EXT

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed: read failure       90%      2007         2471984032

# 2  Short offline       Completed: read failure       10%      1985         2471984032

# 3  Short offline       Completed without error       00%      1949         -

# 4  Short offline       Completed without error       00%      1925         -

# 5  Short offline       Completed without error       00%      1901         -

# 6  Short offline       Completed without error       00%      1877         -

# 7  Short offline       Completed without error       00%      1853         -

# 8  Short offline       Completed without error       00%      1829         -

# 9  Short offline       Completed without error       00%      1802         -

#10  Short offline       Completed without error       00%      1778         -

#11  Short offline       Completed without error       00%      1754         -

#12  Short offline       Completed without error       00%      1734         -

#13  Short offline       Completed without error       00%      1710         -

#14  Short offline       Completed without error       00%      1686         -

#15  Short offline       Completed without error       00%      1662         -

#16  Extended offline    Completed: read failure       40%      1644         2471983952

#17  Short offline       Completed without error       00%      1614         -

#18  Short offline       Completed without error       00%      1590         -

#19  Short offline       Completed without error       00%      1566         -

#20  Short offline       Completed without error       00%      1542         -

#21  Short offline       Completed without error       00%      1518         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

I live in Bangkok, so 60C is not so hot in the middle of the night when the aircon is off. Kit does tend to expire more quickly out here, but this unit has only been operating 83 days! Well Bangkok, as well as being hot, is also the hard disk capital of the world, so I should be able to get it replaced.  :Smile: 

----------

## NeddySeagoon

binro,

```
  5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 112 
```

The drive has 112 sectors that it would like to relocate and none have been relocated yet.

That you get hard errors shows that at least some sectors can no longer be read.

The Seagate Website says 

```
    In Warranty  

Expiration 22-Sep-2013  
```

Don't mess about - save your data and return the drive.

----------

