# How did I fry my BTRFS? parent transid verify failed

## andig

This december, I have created a RAID6 (mkfs.btrfs -f -d raid6 -m raid6 -L slowPool /dev/sd[cdefgh]) with 6 WD Red 5TB disks and copied lots of stuff on it. (Mostly TV records, Ripped DVDs etc.)

This Weekend I realized there is a folder that I believed I had deleted it, which I cannot access. In fact, when I access it, btrfs floods the kernel log and remounts the file system as ro. For example caling du results in this: "du: Zugriff auf './neu' nicht möglich: Veraltete Dateizugriffsnummer (file handle)" (Roughly translates to "stale file handle"). The Messages in the kernel log are similar to these ones:

```
Mär 08 00:34:01 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 4296097792 wanted 8364 found 8359

Mär 08 00:34:01 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 4296097792 wanted 8364 found 8359

Mär 08 00:34:01 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 4296097792 wanted 8364 found 8359

Mär 08 00:34:01 gallifrey kernel: ------------[ cut here ]------------

Mär 08 00:34:01 gallifrey kernel: WARNING: CPU: 2 PID: 1293 at fs/btrfs/extent-tree.c:6549 __btrfs_free_extent.isra.35+0x2e6/0xd30 [btrfs]()

Mär 08 00:34:01 gallifrey kernel: BTRFS: Transaction aborted (error -5)

Mär 08 00:34:01 gallifrey kernel: Modules linked in: uas usb_storage xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf

Mär 08 00:34:01 gallifrey kernel:  ie31200_edac ipmi_si serio edac_core fjes ipmi_msghandler int3402_thermal video processor_thermal_device intel_soc_dts_iosf i

Mär 08 00:34:01 gallifrey kernel: CPU: 2 PID: 1293 Comm: btrfs-transacti Not tainted 4.4.4-gentoo #1

Mär 08 00:34:01 gallifrey kernel: Hardware name: Intel Corporation S1200RP_SE/S1200RP_SE, BIOS S1200RP.86B.02.02.0005.102320140911 10/23/2014

Mär 08 00:34:01 gallifrey kernel:  0000000000000286 000000006a84355a ffff880813adfa10 ffffffff812cb161

Mär 08 00:34:01 gallifrey kernel:  ffff880813adfa58 ffffffffa07714f4 ffff880813adfa48 ffffffff81077712

Mär 08 00:34:01 gallifrey kernel:  00000000fff40000 00000000fffffffb 0000000000000000 ffff88082b5a4000

Mär 08 00:34:01 gallifrey kernel: Call Trace:

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff812cb161>] dump_stack+0x63/0x82

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff81077712>] warn_slowpath_common+0x82/0xc0

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff810777ac>] warn_slowpath_fmt+0x5c/0x80

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06d3036>] __btrfs_free_extent.isra.35+0x2e6/0xd30 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa073c27c>] ? btrfs_merge_delayed_refs+0x8c/0x5b0 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06d7718>] __btrfs_run_delayed_refs+0xac8/0x1260 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06dac62>] btrfs_run_delayed_refs+0x82/0x290 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06dbb3f>] btrfs_write_dirty_block_groups+0xff/0x2b0 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa0768f11>] commit_cowonly_roots+0x218/0x2c2 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06f0192>] btrfs_commit_transaction+0x512/0xb40 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06f085e>] ? start_transaction+0x9e/0x4e0 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06eb353>] transaction_kthread+0x1c3/0x230 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffffa06eb190>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff810953e8>] kthread+0xd8/0xf0

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff81095310>] ? kthread_worker_fn+0x170/0x170

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff8159718f>] ret_from_fork+0x3f/0x70

Mär 08 00:34:01 gallifrey kernel:  [<ffffffff81095310>] ? kthread_worker_fn+0x170/0x170

Mär 08 00:34:01 gallifrey kernel: ---[ end trace 7680c036b5c321b6 ]---

Mär 08 00:34:01 gallifrey kernel: BTRFS: error (device sdd) in __btrfs_free_extent:6549: errno=-5 IO failure

Mär 08 00:34:01 gallifrey kernel: BTRFS info (device sdd): forced readonly

Mär 08 00:34:01 gallifrey kernel: BTRFS: error (device sdd) in btrfs_run_delayed_refs:2927: errno=-5 IO failure

Mär 08 00:34:01 gallifrey kernel: BTRFS warning (device sdd): Skipping commit of aborted transaction.

Mär 08 00:34:01 gallifrey kernel: BTRFS: error (device sdd) in cleanup_transaction:1746: errno=-5 IO failure
```

For a quick check I have run "du -hs" over the whole filesystem and got a couple more of these errors (stale file handle). I can create files and read files perfectly normal but when I access one of the mentioned files, I get messages like the above. 

Googling for "parent transid verify failed" led me to this article. The problem is, I do not have a single disk huge enough to run a "btrfs restore". "btrfs rescue zero-log /dev/sdc" did not make any difference. (Later, I read that zero-log acutually was intended to resolve a long fixed bug, so this way seems like a dead end)

"btrfs rescue chunk-recover -v /dev/sdc" then showed several csum errors to the kernel log:

```
Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68434 start 0

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68434 start 4096

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68434 start 8192

Mär 08 00:52:27 gallifrey kernel: BTRFS error (device sdd): parent transid verify failed on 1416527872 wanted 8366 found 8364

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68434 start 12288

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68434 off 0 csum 2565112818 expected csum 0

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68434 off 4096 csum 3654613761 expected csum 0

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68434 off 8192 csum 2603610151 expected csum 0

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68434 off 12288 csum 1331926123 expected csum 0

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68435 start 0

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68435 start 4096

Mär 08 00:52:27 gallifrey kernel: BTRFS info (device sdd): no csum found for inode 68435 start 8192

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68435 off 0 csum 2736094510 expected csum 0

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68435 off 4096 csum 31092837 expected csum 0

Mär 08 00:52:27 gallifrey kernel: BTRFS warning (device sdd): csum failed ino 68435 off 8192 csum 319331913 expected csum 0
```

Over the night, the process then failed and exited.

Then I tried it with "btrfs scrub", which also failed:

```
Mär 09 00:12:19 gallifrey systemd[1]: Started Btrfs scrub on /mnt/slowPool.

Mär 09 00:12:23 gallifrey btrfs[19067]: ERROR: scrubbing /mnt/slowPool failed for device id 1: ret=-1, errno=5 (Input/output error)

Mär 09 00:12:23 gallifrey btrfs[19067]: ERROR: scrubbing /mnt/slowPool failed for device id 2: ret=-1, errno=5 (Input/output error)

Mär 09 00:12:23 gallifrey btrfs[19067]: ERROR: scrubbing /mnt/slowPool failed for device id 3: ret=-1, errno=5 (Input/output error)

Mär 09 00:12:23 gallifrey btrfs[19067]: ERROR: scrubbing /mnt/slowPool failed for device id 4: ret=-1, errno=5 (Input/output error)

Mär 09 00:12:23 gallifrey btrfs[19067]: ERROR: scrubbing /mnt/slowPool failed for device id 5: ret=-1, errno=5 (Input/output error)

Mär 09 00:12:23 gallifrey btrfs[19067]: ERROR: scrubbing /mnt/slowPool failed for device id 6: ret=-1, errno=5 (Input/output error)

Mär 09 00:12:23 gallifrey btrfs[19067]: scrub canceled for 12b65c7e-2148-4ab5-a1bd-389fabaf4e79

Mär 09 00:12:23 gallifrey btrfs[19067]:         scrub started at Wed Mar  9 00:12:19 2016 and was aborted after 00:00:04

Mär 09 00:12:23 gallifrey btrfs[19067]:         total bytes scrubbed: 334.44MiB with 48 errors

Mär 09 00:12:23 gallifrey btrfs[19067]:         error details: read=48

Mär 09 00:12:23 gallifrey btrfs[19067]:         corrected errors: 0, uncorrectable errors: 48, unverified errors: 0

Mär 09 00:12:23 gallifrey systemd[1]: btrfs-scrub@mnt-slowPool.service: Main process exited, code=exited, status=1/FAILURE

Mär 09 00:12:23 gallifrey systemd[1]: btrfs-scrub@mnt-slowPool.service: Unit entered failed state.

Mär 09 00:12:23 gallifrey systemd[1]: btrfs-scrub@mnt-slowPool.service: Failed with result 'exit-code'.
```

What got my attention are those uncorrectable errors. I mean, isn't the RAID6 there for these not to happen? I am currently running a non destructive "badblocks -nsv -b 4096 /dev/sdc" on all 6 devices and none has had any error yet. The SMART-data of all devices look similar (3 of those HDDs are one year old, the other three are from december):

```
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.4-gentoo] (local build)

Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Model Family:     Western Digital Red

Device Model:     WDC WD50EFRX-68MYMN1

Serial Number:    WD-WX11D4431746

LU WWN Device Id: 5 0014ee 20aa7bb61

Firmware Version: 82.00A82

User Capacity:    5.000.981.078.016 bytes [5,00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    5700 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Thu Mar 10 19:25:19 2016 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (  41) The self-test routine was interrupted

                                        by the host with a hard or soft reset.

Total time to complete Offline 

data collection:                (55440) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 554) minutes.

Conveyance self-test routine

recommended polling time:        (   5) minutes.

SCT capabilities:              (0x303d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0

  3 Spin_Up_Time            0x0027   195   192   021    Pre-fail  Always       -       9225

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       631

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9021

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       230

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       44

193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3359

194 Temperature_Celsius     0x0022   103   100   000    Old_age   Always       -       49

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       2

200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended captive    Interrupted (host reset)      90%      8993         -

# 2  Short offline       Completed without error       00%      8993         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

The interrupted extended captive self-test happened on all devices... From what I read, this could come from the power saving that all devices sleep after 20 minutes. But I will verify this with powermanagement turned off when badblocks has finished. (Read: tomorrow... Damn big and slow HDDs...)

Two of the disks hang are connected to a PCIe SATA controller (With ASMEDIA ASM1061). These show "UDMA_CRC_Error_Count 2" but this low number should be nothing to worry about, right?

Also, I realized that right after booting, these two devices are getting their link speed limited to 3.0 Gbps:

```
Mär 08 00:33:46 gallifrey kernel: ata8.00: exception Emask 0x10 SAct 0xc00 SErr 0x400000 action 0x6 frozen

Mär 08 00:33:46 gallifrey kernel: ata8.00: irq_stat 0x08000000, interface fatal error

Mär 08 00:33:46 gallifrey kernel: ata8: SError: { Handshk }

Mär 08 00:33:46 gallifrey kernel: ata8.00: failed command: WRITE FPDMA QUEUED

Mär 08 00:33:46 gallifrey kernel: ata8.00: cmd 61/00:50:80:d7:68/02:00:07:00:00/40 tag 10 ncq 262144 out

                                            res 40/00:58:a0:83:17/00:00:00:00:00/40 Emask 0x10 (ATA bus error)

                                            res 40/00:58:a0:83:17/00:00:00:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:46 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:46 gallifrey kernel: ata8.00: failed command: READ FPDMA QUEUED

Mär 08 00:33:46 gallifrey kernel: ata8.00: cmd 60/40:58:a0:83:17/00:00:00:00:00/40 tag 11 ncq 32768 in

                                            res 40/00:58:a0:83:17/00:00:00:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:46 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:46 gallifrey kernel: ata8: hard resetting link

Mär 08 00:33:47 gallifrey kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Mär 08 00:33:47 gallifrey kernel: ata8.00: configured for UDMA/133

Mär 08 00:33:47 gallifrey kernel: ata8: EH complete

Mär 08 00:33:47 gallifrey kernel: ata8.00: exception Emask 0x10 SAct 0xc000 SErr 0x400000 action 0x6 frozen

Mär 08 00:33:47 gallifrey kernel: ata8.00: irq_stat 0x08000000, interface fatal error

Mär 08 00:33:47 gallifrey kernel: ata8: SError: { Handshk }

Mär 08 00:33:47 gallifrey kernel: ata8.00: failed command: READ FPDMA QUEUED

Mär 08 00:33:47 gallifrey kernel: ata8.00: cmd 60/40:70:a0:83:17/00:00:00:00:00/40 tag 14 ncq 32768 in

                                            res 40/00:78:80:d7:68/00:00:07:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:47 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:47 gallifrey kernel: ata8.00: failed command: WRITE FPDMA QUEUED

Mär 08 00:33:47 gallifrey kernel: ata8.00: cmd 61/00:78:80:d7:68/02:00:07:00:00/40 tag 15 ncq 262144 out

                                            res 40/00:78:80:d7:68/00:00:07:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:47 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:47 gallifrey kernel: ata8: hard resetting link

Mär 08 00:33:47 gallifrey kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Mär 08 00:33:47 gallifrey kernel: ata8.00: configured for UDMA/133

Mär 08 00:33:47 gallifrey kernel: ata8: EH complete

Mär 08 00:33:47 gallifrey kernel: ata8.00: exception Emask 0x10 SAct 0x18000000 SErr 0x400000 action 0x6 frozen

Mär 08 00:33:47 gallifrey kernel: ata8.00: irq_stat 0x08000000, interface fatal error

Mär 08 00:33:47 gallifrey kernel: ata8: SError: { Handshk }

Mär 08 00:33:47 gallifrey kernel: ata8.00: failed command: WRITE FPDMA QUEUED

Mär 08 00:33:47 gallifrey kernel: ata8.00: cmd 61/00:d8:80:d7:68/02:00:07:00:00/40 tag 27 ncq 262144 out

                                            res 40/00:e0:a0:83:17/00:00:00:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:47 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:47 gallifrey kernel: ata8.00: failed command: READ FPDMA QUEUED

Mär 08 00:33:47 gallifrey kernel: ata8.00: cmd 60/40:e0:a0:83:17/00:00:00:00:00/40 tag 28 ncq 32768 in

                                            res 40/00:e0:a0:83:17/00:00:00:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:47 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:47 gallifrey kernel: ata8: hard resetting link

Mär 08 00:33:48 gallifrey kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Mär 08 00:33:48 gallifrey kernel: ata8.00: configured for UDMA/133

Mär 08 00:33:48 gallifrey kernel: ata8: EH complete

Mär 08 00:33:53 gallifrey kernel: ata8: limiting SATA link speed to 3.0 Gbps

Mär 08 00:33:53 gallifrey kernel: ata8.00: exception Emask 0x10 SAct 0xc00000 SErr 0x400000 action 0x6 frozen

Mär 08 00:33:53 gallifrey kernel: ata8.00: irq_stat 0x08000000, interface fatal error

Mär 08 00:33:53 gallifrey kernel: ata8: SError: { Handshk }

Mär 08 00:33:53 gallifrey kernel: ata8.00: failed command: WRITE FPDMA QUEUED

Mär 08 00:33:53 gallifrey kernel: ata8.00: cmd 61/00:b0:80:ca:9f/02:00:0b:00:00/40 tag 22 ncq 262144 out

                                            res 40/00:b8:80:19:40/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:53 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:53 gallifrey kernel: ata8.00: failed command: READ FPDMA QUEUED

Mär 08 00:33:53 gallifrey kernel: ata8.00: cmd 60/00:b8:80:19:40/01:00:0e:00:00/40 tag 23 ncq 131072 in

                                            res 40/00:b8:80:19:40/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)

Mär 08 00:33:53 gallifrey kernel: ata8.00: status: { DRDY }

Mär 08 00:33:53 gallifrey kernel: ata8: hard resetting link

Mär 08 00:33:54 gallifrey kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 320)

Mär 08 00:33:54 gallifrey kernel: ata8.00: configured for UDMA/133

Mär 08 00:33:54 gallifrey kernel: ata8: EH complete
```

After the link limiting, I have no other error messages like these. Is this a possible cause of ripping the file system?

As far as I read, the transid errors come from incidents that happen between writing the journal and writing the data. But I don't remember a crash in the last time...

I do have a backup of all the important stuff but it would be a major pita to have to rip all those films again or even have to wait till some shows are shown again on TV... Therefore, does anyone here have an idea how to repair this?

Some other system information:

```
root@gallifrey ~ # emerge --info

Portage 2.2.27 (python 2.7.11-final-0, default/linux/amd64/13.0/systemd, gcc-5.3.0, glibc-2.22-r2, 4.4.4-gentoo x86_64)

=================================================================

System uname: Linux-4.4.4-gentoo-x86_64-Intel-R-_Xeon-R-_CPU_E3-1231_v3_@_3.40GHz-with-gentoo-2.2

KiB Mem:    32954140 total,  29727752 free

KiB Swap:          0 total,         0 free

Timestamp of repository gentoo: Mon, 07 Mar 2016 22:30:01 +0000

sh bash 4.3_p42-r2

ld GNU ld (Gentoo 2.25.1 p1.1) 2.25.1

distcc 3.2rc1 x86_64-pc-linux-gnu [disabled]

app-shells/bash:          4.3_p42-r2::gentoo

dev-lang/perl:            5.22.1::gentoo

dev-lang/python:          2.7.11-r2::gentoo, 3.4.3-r7::gentoo

dev-util/cmake:           3.4.3::gentoo

dev-util/pkgconfig:       0.29.1::gentoo

sys-apps/baselayout:      2.2::gentoo

sys-apps/openrc:          0.20.5::gentoo

sys-apps/sandbox:         2.10-r1::gentoo

sys-devel/autoconf:       2.13::gentoo, 2.69-r1::gentoo

sys-devel/automake:       1.11.6-r2::gentoo, 1.13.4-r1::gentoo, 1.14.1-r1::gentoo, 1.15-r2::gentoo

sys-devel/binutils:       2.25.1-r1::gentoo

sys-devel/gcc:            4.9.3::gentoo, 5.3.0::gentoo

sys-devel/gcc-config:     1.8-r1::gentoo

sys-devel/libtool:        2.4.6-r2::gentoo

sys-devel/make:           4.1-r1::gentoo

sys-kernel/linux-headers: 4.4::gentoo (virtual/os-headers)

sys-libs/glibc:           2.22-r2::gentoo

Repositories:

~snip~

Installed sets: @picons

ACCEPT_KEYWORDS="amd64 ~amd64"

ACCEPT_LICENSE="* -@EULA"

CBUILD="x86_64-pc-linux-gnu"

CFLAGS="-march=native -O2 -pipe"

CHOST="x86_64-pc-linux-gnu"

CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"

CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"

CXXFLAGS="-march=native -O2 -pipe"

DISTDIR="/usr/portage/distfiles"

EMERGE_DEFAULT_OPTS="--quiet-build=y"

FCFLAGS="-O2 -pipe"

FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms split-elog strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"

FFLAGS="-O2 -pipe"

GENTOO_MIRRORS="http://distfiles.gentoo.org"

LANG="de_DE.utf8"

LDFLAGS="-Wl,-O1 -Wl,--as-needed"

MAKEOPTS="-j8"

PKGDIR="/usr/portage/packages"

PORTAGE_CONFIGROOT="/"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"

PORTAGE_TMPDIR="/var/tmp"

USE="acl aes amd64 autoipd avahi avx avx2 bazaar berkdb bzip2 cli cracklib crypt cryptsetup cxx dbus djvu dri fftw fma3 fortran gdbm git hdri iconv icu ipv6 jpeg lm_sensors mdnsresponder-compat mercurial mmx mmxext modules multilib ncurses nfs nfsv4 nfsv41 nls nptl openmp pam pcre png popcnt rar readline samba script seccomp session sse sse2 sse3 sse4_1 sse4_2 ssl ssse3 subversion svg systemd tcpd threads tigg udev unicode upnp usbredir webp wmf xattr zeroconf zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="aes avx avx2 fma3 mmx mmxext popcnt sse sse2 sse3 sse4_1 sse4_2 ssse3" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ublox ubx" GRUB_PLATFORMS="efi-64" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LINGUAS="de" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python3_4" PYTHON_TARGETS="python2_7 python3_4" QEMU_SOFTMMU_TARGETS="x86_64 aarch64 arm" RUBY_TARGETS="ruby20 ruby21" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"

USE_PYTHON="2.7"

Unset:  CC, CPPFLAGS, CTARGET, CXX, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
```

----------

## Roman_Gruber

Well I am sorry for your loss.

I hope you are well aware when you use an untested, experimental software (call it junkware), you have to cope with such stuff.

Even ext4 had issues and had many changes, even when it was considered "stable" or "useable".

Btrfs looks like alphasoftware a year ago, last time i checked it.

I have my reason why i do not use anymore xfs, jfs, reiserfs, and stick to good old ext4. I dealt with many broken ext3 and ext2 partitions in the past, I know how to deal with that. With these days SSD and drives there is no reason to use something else, data integrity and security goes over minimal speed improvements. The only thing that really worth using is lvm + luks + ext4 (no more installation without that for over 3 years on different hardware) Even my external drives use this setup.

--

Regarding your question. You may ask kerneloftruth or frostschutz on this forum.

There are usually some btrfs tools available, you may also ask the guys directly. We are just a support forum, but i am sure that btrfs thingy has some user forum or mailing list where you could ask directly.

The germans says: Geh zum Schmied und nicht zum Schmiedl. Loosley translates, ask the developres of this "Junkware" and not the user base.

----------

## Ant P.

 *andig wrote:*   

> Two of the disks hang are connected to a PCIe SATA controller (With ASMEDIA ASM1061). These show "UDMA_CRC_Error_Count 2" but this low number should be nothing to worry about, right?
> 
> Also, I realized that right after booting, these two devices are getting their link speed limited to 3.0 Gbps:

 

That controller might be borderline-flaky. The fact that most of your high-level errors are pointing to /dev/sdd may be related; is that disk on those ports?

----------

## andig

Thanks for the quick and harsh but aparantly true reply. Luckily, I still have access to nearly everything.

I will post my issue on the BTRFS mailing list. If I get some sort of solution there, it will find its way to this forum.

 *Quote:*   

> 
> 
> The germans says: Geh zum Schmied und nicht zum Schmiedl. Loosley translates, ask the developres of this "Junkware" and not the user base.

 

As a german, I have never even heard of this saying  :Smile: 

 *Ant P. wrote:*   

>  *andig wrote:*   Two of the disks hang are connected to a PCIe SATA controller (With ASMEDIA ASM1061). These show "UDMA_CRC_Error_Count 2" but this low number should be nothing to worry about, right?
> 
> Also, I realized that right after booting, these two devices are getting their link speed limited to 3.0 Gbps: 
> 
> That controller might be borderline-flaky. The fact that most of your high-level errors are pointing to /dev/sdd may be related; is that disk on those ports?

 

Is there a good way to find this out?

As far as I read from this:

```
# ls -lh /sys/dev/block/*

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:0 -> ../../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:1 -> ../../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda1/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:112 -> ../../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/ata8/host7/target7:0:0/7:0:0:0/block/sdh/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:16 -> ../../devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sdb/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:17 -> ../../devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sdb/sdb1/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:18 -> ../../devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sdb/sdb2/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:2 -> ../../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda2/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:32 -> ../../devices/pci0000:00/0000:00:1f.2/ata3/host2/target2:0:0/2:0:0:0/block/sdc/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:48 -> ../../devices/pci0000:00/0000:00:1f.2/ata4/host3/target3:0:0/3:0:0:0/block/sdd/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:64 -> ../../devices/pci0000:00/0000:00:1f.2/ata5/host4/target4:0:0/4:0:0:0/block/sde/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:80 -> ../../devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0/block/sdf/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/8:96 -> ../../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/ata7/host6/target6:0:0/6:0:0:0/block/sdg/

lrwxrwxrwx 1 root root 0  8. Mär 08:47 /sys/dev/block/9:127 -> ../../devices/virtual/block/md127/
```

The devices ata7 (sdh) and ata8 (sdg) are attached to the controller, see this snippet from lspci -tv:

```
+-1c.4-[06]----00.0  ASMedia Technology Inc. ASM1062 Serial ATA Controller
```

And telling from the current progress of badblocks, the device itself seems OK...

```
root@gallifrey ~ # badblocks -nsv -b 4096 /dev/sdd

Es wird nach defekten Blöcken im zerstörungsfreien Lesen+Schreiben-Modus gesucht

Von Block 0 bis 1220942645

Es wird nach defekten Blöcken gesucht (zerstörungsfreier Lesen+Schreiben-Modus)

Es wird mit zufälligen Mustern getestet:  92.85% erledigt, 32:06:49 verstrichen. (0/0/0 Fehler)

```

----------

## frostschutz

 *andig wrote:*   

> Two of the disks hang are connected to a PCIe SATA controller (With ASMEDIA ASM1061).

 

I'm using three controllers with that chipset and the cards were dead cheap (10€ a piece). It's a card with 2 eSATA and 2 internal SATA each with a set of 4 jumpers so you can use either ext. or int. port... However the SATA cable does not snap onto the internal ports so it can get loose easily if you were to move your PC around a lot.

However even so they work fine for me and I would use them again. I think you got either bad cable, loose cable, or bad card.

The only failure with this controller is some ATAPI commands, it needs libata.atapi_passthru16=0 to work with optical drives w/o crashing.

In regards to the btrfs issue I'm not able to help, but as soon as you get a kernel panic it should probably go to the devs / mailing list anyhow.

----------

## Roman_Gruber

regarding asmedia controllers, i would stay away from those. I have read awful commments regarding asmedia for external hdd cases on a popular sales page from different users. these shall have some issues with hdd encodings and how they implemented that sata bridge

----------

## andig

The Computer gets barely moved. I have checked the wiring of the SATA-ports on the PCIe-card after I realized those problems.

Out of interest, do you also have those ATA errors in your kernel log?

So for now it is 1 pro and 1 con Asmedia...

tw04l124, any recommendations on an affordable PCIe SATA controller with at least 2 ports?

As a side note, the 3 older devices also had a BTRFS (but with raid5) and it worked for over one year. When I bought the additional 3 devices, I wiped the old disks and recreated the complete array as raid6 with 6 disks.

----------

## frostschutz

 *andig wrote:*   

> Out of interest, do you also have those ATA errors in your kernel log?

 

Nope, no errors whatsoever...

The same chip is also onboard for many boards that offer +2 additional SATA ports. (ASUS etc.)

----------

## andig

Okay, then I will try out new cables when I'm home this evening.

I will also post this issue on the BTRFS mailing list.

----------

## petan

 *tw04l124 wrote:*   

> Well I am sorry for your loss.
> 
> I hope you are well aware when you use an untested, experimental software (call it junkware), you have to cope with such stuff.
> 
> Even ext4 had issues and had many changes, even when it was considered "stable" or "useable".
> ...

 

I am using BTRFS for years even on production servers and never had a problem with it. Every FS can get corrupted beyond level of recovery, which is what backups are for. Even ext4 can get easily fucked up in a way you have no way to recover it, no matter how stable it is.

This could very well be related to failed hardware, so I don't see any reason why you should blame the software.

As for why you should use btrfs over obsolete and franky quite ancient ext4 are its features:

* Online shrinking of FS

* Online filesystem checks

* Snapshoting and subvolumes

* Data integrity verification

* Transparent compression

Normal users may not need or use any of these features, but there is probably quite a number of advanced users who can't live without them. And this is where ext4 inherently fails.

----------

## Atmmac

 *petan wrote:*   

>  *tw04l124 wrote:*   Well I am sorry for your loss.
> 
> I hope you are well aware when you use an untested, experimental software (call it junkware), you have to cope with such stuff.
> 
> Even ext4 had issues and had many changes, even when it was considered "stable" or "useable".
> ...

 

BTRFS in production is ballsy..

----------

## Roman_Gruber

@ petan

I think i have written, I may be wrong

LVM + Luks + ext4.

In production moving of the running system to a new harddisc (already done several times)

lvm should support snapshotting, others have used it, i did not. I can afford the luxuary to move the data with a livedisc

Encryption (no support in brtrs). So what do you do with discs where the controller board dies and your data is readable because of lack of encryption ...

ext4 (good base) ext2 worked well since i had it. ext3 was buggy because of power losses which were never considered. ext4 is a bugfix for those ext3 powerloss bugs.

I admit i am a bit a fool to have ext4 on my backups and ext4 on my production systems. thats foolish and I will reflect that on my next backup iteration.

--

lvm is well documented by redhat and has features which the mayority of filesystems lack. 

luks is essential when you want to throw away harddiscs without phyiscally damaging them or using other means of destroying data which i doubt

ext4 may be one of many choices but one of the better choices 

--

* Online shrinking of FS

except xfs (i think) any decent fs supports shrinking. Online shrinking may cause issues and i doubt it really works. 

* Online filesystem checks

I wonder if this really works or just are features on the paper. 

When i look at windi 95 os and later, they could not manage that, so i doubt the open source world can manage it. It is too complicated.

* Data integrity verification

Creating a hashsum for data block is essential and in any file system, even in ntfs. and even in tcp / ip

That means only to recalculate checksum of data blocks and than verify the written value. nothing special

I am not a programmer, but every network node for example does this on every hop and discards it. I assume the same is done by any computer which reads a file.

* Transparent compression 

A tradeoff of cpu cycles for performance. When it is needed okay. The weakest point these days is always the cpu bottleneck with these days ssd.

A decent file system has performance and does not need too much juice from the cpu.

* Snapshoting and subvolumes

LVM works well and is well tested. One of the better choices.

No words about mirror and any other featues which lvm supports for ages.

No useful abstraction layer so you can move your data regardless of the underlying hardware. LVM names it phyiscal extents. Creating bigger discs, moving extents on how you need those ... 

There is a reason why lvm exists and it does it job well. Putting everything in a filesystem is nuts. I look at systemd junk, tehy also put everything there but it may be smart or not (extra topic), everyone has an opinion on that.

in the unix world you usually have smaller tools which you put together and not a big tool which implements everything. 

---

lvm + luks + ext4 (well tested)

btrfs (playing with fire, may work or may not work. Nice when you do not care for the data loss since your last backup)

I kick every 2 years or less my harddrives, and i use them heavily, even ssds on a daily basis. no mayor issue since many years. although ext4 have had some regressions

----------

## andig

Now with swapped cables, I at least have no more ATA-messages in the kernel log. I have looked through some recent logs and it seems like they appeared after every boot.

It seems like the cables labeled with "Serial ATA 3.0" were not enough...

----------

