# Периодические I\O-проблемы+растет UDMA_CRC_Error_Count

## bat0r

# uname -a 

Linux myserver 3.3.4-gentoo #1 SMP Wed May 9 22:58:59 MSK 2012 i686 Intel(R) Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux 

 Периодчиески после старта сервера возникают проблемы с дисковым I\O .... 

Во время проблемы:

 # hdparm -tT /dev/sda 

/dev/sda: 

 Timing cached reads: 2 MB in 5.68 seconds = 360.54 kB/sec 

 Timing buffered disk reads: 2 MB in 12.76 seconds = 160.50 kB/sec 

 /var/log/messages: 

Jun 24 19:46:32 localhost kernel: res 51/84:00:88:dc:51/00:00:00:00:00/e0 Emask 0x12 (ATA bus error) 

 Jun 24 19:46:32 localhost kernel: ata3.00: status: { DRDY ERR } 

 Jun 24 19:46:32 localhost kernel: ata3.00: error: { ICRC ABRT } 

 Jun 24 19:46:32 localhost kernel: ata3: hard resetting link 

 Jun 24 19:46:32 localhost kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) 

 Jun 24 19:46:32 localhost kernel: ata3.00: configured for UDMA/33 

 Jun 24 19:46:32 localhost kernel: ata3: EH complete 

 Jun 24 19:46:32 localhost kernel: ata3.00: exception Emask 0x12 SAct 0x0 SErr 0x7a0601 action 0x6 

 Jun 24 19:46:32 localhost kernel: ata3.00: BMDMA stat 0x4 

 Jun 24 19:46:32 localhost kernel: ata3: SError: { RecovData Persist Proto PHYInt 10B8B Dispar BadCRC Handshk } 

 Jun 24 19:46:32 localhost kernel: ata3.00: failed command: READ DMA 

 Jun 24 19:46:32 localhost kernel: ata3.00: cmd c8/00:20:48:99:54/00:00:00:00:00/e0 tag 0 dma 16384 in 

 Jun 24 19:46:32 localhost kernel: res 51/84:00:48:99:54/00:00:00:00:00/e0 Emask 0x12 (ATA bus error) 

 Jun 24 19:46:32 localhost kernel: ata3.00: status: { DRDY ERR } 

 Jun 24 19:46:32 localhost kernel: ata3.00: error: { ICRC ABRT } 

 Jun 24 19:46:32 localhost kernel: ata3: hard resetting link 

 Jun 24 19:46:32 localhost kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) 

 Jun 24 19:46:32 localhost kernel: ata3.00: configured for UDMA/33 

 Jun 24 19:46:32 localhost kernel: ata3: EH complete 

 Jun 24 19:46:32 localhost kernel: sis900.c: v1.08.10 Apr. 2 2006 

 Jun 24 19:46:32 localhost kernel: 0000:00:04.0: Realtek RTL8201 PHY transceiver found at address 1. 

 Jun 24 19:46:32 localhost kernel: 0000:00:04.0: Using transceiver found at address 1 as default 

 Jun 24 19:46:32 localhost kernel: eth0: SiS 900 PCI Fast Ethernet at 0xd800, IRQ 19, 00:15:f2:e9:19:7c 

 Jun 24 19:46:32 localhost udevd[659]: renamed network interface eth0 to eth1 

 Jun 24 19:46:32 localhost kernel: ata3.00: exception Emask 0x12 SAct 0x0 SErr 0x7a0601 action 0x6 

 Jun 24 19:46:32 localhost kernel: ata3.00: BMDMA stat 0x5 

 Jun 24 19:46:32 localhost kernel: ata3: SError: { RecovData Persist Proto PHYInt 10B8B Dispar BadCRC Handshk } 

 Jun 24 19:46:32 localhost kernel: ata3.00: failed command: READ DMA 

 Jun 24 19:46:32 localhost kernel: ata3.00: cmd c8/00:40:b8:dc:51/00:00:00:00:00/e0 tag 0 dma 32768 in 

 Jun 24 19:46:32 localhost kernel: res 51/84:2f:b8:dc:51/00:00:00:00:00/e0 Emask 0x12 (ATA bus error) 

 Jun 24 19:46:32 localhost kernel: ata3.00: status: { DRDY ERR } 

 Jun 24 19:46:32 localhost kernel: ata3.00: error: { ICRC ABRT } 

 Jun 24 19:46:32 localhost kernel: ata3: hard resetting link 

 Jun 24 19:46:32 localhost kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) 

 Jun 24 19:46:32 localhost kernel: ata3.00: configured for UDMA/33 

 Jun 24 19:46:32 localhost kernel: ata3: EH complete 

 Jun 24 19:46:32 localhost kernel: 3c59x: Donald Becker and others. 

 Jun 24 19:46:32 localhost kernel: 0000:00:09.0: 3Com PCI 3c905C Tornado at f87cc800. 

 Jun 24 19:46:32 localhost kernel: ata3.00: exception Emask 0x12 SAct 0x0 SErr 0x7a0601 action 0x6 

 Jun 24 19:46:32 localhost kernel: ata3.00: BMDMA stat 0x4 

 Jun 24 19:46:32 localhost kernel: ata3: SError: { RecovData Persist Proto PHYInt 10B8B Dispar BadCRC Handshk } 

 Jun 24 19:46:32 localhost kernel: ata3.00: failed command: READ DMA 

 Jun 24 19:46:32 localhost kernel: ata3.00: cmd c8/00:08:00:98:81/00:00:00:00:00/e0 tag 0 dma 4096 in 

 Jun 24 19:46:32 localhost kernel: res 51/84:00:00:98:81/00:00:00:00:00/e0 Emask 0x12 (ATA bus error) 

 Jun 24 19:46:32 localhost kernel: ata3.00: status: { DRDY ERR } 

 Jun 24 19:46:32 localhost kernel: ata3.00: error: { ICRC ABRT } 

 Jun 24 19:46:32 localhost kernel: ata3: hard resetting link 

 Jun 24 19:46:32 localhost kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) 

 Jun 24 19:46:32 localhost kernel: ata3.00: configured for UDMA/33 

 Jun 24 19:46:32 localhost kernel: ata3: EH complete 

 ...... 

 После перезагрузки все ОК. 

 # hdparm -tT /dev/sda 

 /dev/sda: 

 Timing cached reads: 704 MB in 2.00 seconds = 351.22 MB/sec 

 Timing buffered disk reads: 286 MB in 3.01 seconds = 95.00 MB/sec 

 Через несколько перезагрузок проблема вновь появляется вместе с запуском сервера - сервер стартует минут 10 и потом сильные задержки i\o, производительность резко падает. Проблема независит от того, вручную стартую сервер, или через WoL.

Заменил SATA-кабель для диска /dev/sda, но проблема осталась. 

Кроме того, проблема с деградацией I\O одинаково проявляется и на PATA диске (/dev/sdb). 

SATA-PATA controller SiS. 

 # lspci|grep IDE 

 00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 IDE Controller (rev 01) 

 00:05.0 IDE interface: Silicon Integrated Systems [SiS] SATA (rev 01) 

 # smartctl -i /dev/sda 

 smartctl 5.42 2011-10-20 r3458 [i686-linux-3.3.4-gentoo] (local build) 

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

 === START OF INFORMATION SECTION === 

 Model Family: Western Digital Caviar Green (Adv. Format) 

 Device Model: WDC WD20EARS-00MVWB0 

 LU WWN Device Id: 5 0014ee 25b018e83 

 Firmware Version: 51.0AB51 

 User Capacity: 2 000 394 706 432 bytes [2,00 TB] 

 Sector Size: 512 bytes logical/physical 

 Device is: In smartctl database [for details use: -P show] 

 ATA Version is: 8 

 ATA Standard is: Exact ATA specification draft version not indicated 

 Local Time is: Sun Jun 24 20:01:33 2012 MSK 

 SMART support is: Available - device has SMART capability. 

 SMART support is: Enabled 

 # smartctl -i /dev/sdb 

 smartctl 5.42 2011-10-20 r3458 [i686-linux-3.3.4-gentoo] (local build) 

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

 === START OF INFORMATION SECTION === 

 Model Family: Hitachi Deskstar 7K250 

 Device Model: HDS722525VLAT80 

 Firmware Version: V36OA6MA 

 User Capacity: 250 058 268 160 bytes [250 GB] 

 Sector Size: 512 bytes logical/physical 

 Device is: In smartctl database [for details use: -P show] 

 ATA Version is: 6 

 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a 

 Local Time is: Sun Jun 24 20:02:15 2012 MSK 

 SMART support is: Available - device has SMART capability. 

 SMART support is: Enabled 

Sata диск - знаменитый своей интеллигентой "зеленой" частой парковкой головок для уменьшения энергопотребления и улучшения климата нашейпланеты -  WDC Green Caviar, у которого таймаут парковки стоит в 8 сек - дефолтовый параметр от WD ("idle3" timeout value). 

"This timeout controls how often the drive parks its heads and enters a low power consumption state. "

Поменял этот таймаут в 300сек, но проблема осталась.

 # hdparm -J /dev/sda 

 /dev/sda: 

 wdidle3 = 300 secs (or 13.8 secs for older drives)

 myserver ~ # smartctl -A /dev/sdb 

 smartctl 5.42 2011-10-20 r3458 [i686-linux-3.3.4-gentoo] (local build) 

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

 === START OF READ SMART DATA SECTION === 

 SMART Attributes Data Structure revision number: 16 

 Vendor Specific SMART Attributes with Thresholds: 

 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 

 1 Raw_Read_Error_Rate 0x000b 100 100 060 Pre-fail Always - 0 

 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 

 3 Spin_Up_Time 0x0007 096 096 024 Pre-fail Always - 347 (Average 377) 

 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 448 

 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 

 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 

 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0 

 9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 13011 

 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 

 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 448 

 192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 998 

 193 Load_Cycle_Count 0x0012 100 100 050 Old_age Always - 998 

 194 Temperature_Celsius 0x0002 130 130 000 Old_age Always - 42 (Min/Max 16/5 

 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 

 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 

 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 84588 

 myserver ~ # smartctl -A /dev/sda 

smartctl 5.42 2011-10-20 r3458 [i686-linux-3.3.4-gentoo] (local build) 

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

 === START OF READ SMART DATA SECTION === 

 SMART Attributes Data Structure revision number: 16 

 Vendor Specific SMART Attributes with Thresholds: 

 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 

 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 

 3 Spin_Up_Time 0x0027 253 186 021 Pre-fail Always - 1025 

 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 53 

 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 

 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 

 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 255 

 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 

 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 

 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 46 

 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 28 

 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2576 

 194 Temperature_Celsius 0x0022 112 101 000 Old_age Always - 38 

 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 

 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 

 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 

199 UDMA_CRC_Error_Count 0x0032 200 155 000 Old_age Always - 4811 

 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

Везде пишут, что эта проблема связана с транспортом от контроллера диска до контроллера матплаты(или карты расширения), или драйвер глючит. 

Первое не подтверждается, т.к. кабель был заменен.

В ядре сконфигурено:

 CONFIG_SATA_SIS: 

 x This option enables support for SiS Serial ATA on x 

 x SiS 964/965/966/180 and Parallel ATA on SiS 180. x 

 x The PATA support for SiS 180 requires additionally to x 

 x enable the PATA_SIS driver in the config. x 

 x If unsure, say N. x 

 x x 

 x Symbol: SATA_SIS [=y] 

 CONFIG_PATA_SIS: x 

 x x 

 x This option enables support for SiS PATA controllers x 

 x x 

 x If unsure, say N. x 

 x x 

 x Symbol: PATA_SIS [=y] x 

 x Type : tristate x 

 x Prompt: SiS PATA support x 

 x Defined at drivers/ata/Kconfig:663 x 

 x Depends on: ATA [=y] && ATA_SFF [=y] && ATA_BMDMA [=y] && PCI [=y] x 

 x Location: x 

 x -> Device Drivers x 

 x -> Serial ATA and Parallel ATA drivers (ATA [=y]) x 

 x -> ATA SFF support (ATA_SFF [=y]) x 

 x -> ATA BMDMA support (ATA_BMDMA [=y]) x 

 x Selected by: SATA_SIфS [=y] && ATA [=y] && ATA_SFF [=y] && ATA_BMDMA [=y] && PCI [=y]

----------

## burik666

Была подобная проблема, поменял порт SATA в который воткнут HDD, если диск в порядке, то может умер SATA контроллер.

----------

