# [SOLVED] Secondary SATA HD stalls during video playback

## VinzC

Hi all.

I have installed a secondary SATA hard drive in my machine and ever since I noticed while playing video files the video suddenly stalls for a short while (one or two seconds) and then goes on. I initially thought it was a hardware issue but I see no error in the kernel log. This is not a bad block in the file either because rewinding just prior to when the video stalled plays normally. These "pauses" occur on a random basis.

I still don't know if it's a hardware or a kernel-related issue. All I can say is I don't experience it when I play the same video from my main disk, i.e. where Gentoo is installed, with swap, home, everything. The machine still is responsive though. It's only noticed during video playback.

Does anybody have an idea how I can nail down this issue?

FYI: Current kernel is 3.10.7, HD model is ST31000528AS.

----------

## aCOSwt

what does hdparm --direct -tT /dev/sd?? tell on both filesystems ?

what does hdparm -a /dev/sd? tell on both devices ?

what does dumpe2fs -h /dev/sd?? tell on both filesystems ?

Well... tell us a little bit more!

----------

## VinzC

 *aCOSwt wrote:*   

> what does hdparm --direct -tT /dev/sd?? tell on both filesystems ?
> 
> what does hdparm -a /dev/sd? tell on both devices ?
> 
> what does dumpe2fs -h /dev/sd?? tell on both filesystems ?
> ...

 

Thanks for the hints, aCOSwt.

Here you go! (Note read ahead is on with both drives)

```
/dev/sda3:

 Timing O_DIRECT cached reads:   464 MB in  2.00 seconds = 231.66 MB/sec

 Timing O_DIRECT disk reads: 200 MB in  1.72 seconds = 116.29 MB/sec
```

```

dumpe2fs 1.42 (29-Nov-2011)

Filesystem volume name:   boot

Last mounted on:          <not available>

Filesystem UUID:          72627d6a-eafb-4d94-8dc6-d42b07f0c869

Filesystem magic number:  0xEF53

Filesystem revision #:    1 (dynamic)

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype sparse_super

Filesystem flags:         signed_directory_hash 

Default mount options:    (none)

Filesystem state:         clean

Errors behavior:          Continue

Filesystem OS type:       Linux

Inode count:              51200

Block count:              204800

Reserved block count:     10240

Free blocks:              183178

Free inodes:              51168

First block:              1

Block size:               1024

Fragment size:            1024

Reserved GDT blocks:      256

Blocks per group:         8192

Fragments per group:      8192

Inodes per group:         2048

Inode blocks per group:   256

Filesystem created:       Sat Mar 24 11:09:09 2012

Last mount time:          Fri Feb  7 00:33:14 2014

Last write time:          Mon Mar 10 20:23:38 2014

Mount count:              0

Maximum mount count:      39

Last checked:             Mon Mar 10 20:23:38 2014

Check interval:           15552000 (6 months)

Next check after:         Sat Sep  6 21:23:38 2014

Reserved blocks uid:      0 (user root)

Reserved blocks gid:      0 (group root)

First inode:              11

Inode size:             128

Journal inode:            8

Default directory hash:   half_md4

Directory Hash Seed:      e84fb412-9414-4543-b635-1b2ecebed5b1

Journal backup:           inode blocks

Journal features:         journal_incompat_revoke

Journal size:             4113k

Journal length:           4096

Journal sequence:         0x0000017a

Journal start:            0
```

Now about the «lagging» disk:

```
/dev/sdb1:

 Timing O_DIRECT cached reads:   472 MB in  2.00 seconds = 235.54 MB/sec

 Timing O_DIRECT disk reads: 358 MB in  3.01 seconds = 119.12 MB/sec
```

```

dumpe2fs 1.42 (29-Nov-2011)

Filesystem volume name:   media

Last mounted on:          /media/media

Filesystem UUID:          c3e71afc-dafa-4f5d-ad11-652724a25568

Filesystem magic number:  0xEF53

Filesystem revision #:    1 (dynamic)

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

Filesystem flags:         signed_directory_hash 

Default mount options:    user_xattr acl

Filesystem state:         clean

Errors behavior:          Continue

Filesystem OS type:       Linux

Inode count:              52428800

Block count:              209715200

Reserved block count:     10485760

Free blocks:              67302646

Free inodes:              52421096

First block:              0

Block size:               4096

Fragment size:            4096

Reserved GDT blocks:      974

Blocks per group:         32768

Fragments per group:      32768

Inodes per group:         8192

Inode blocks per group:   512

Flex block group size:    16

Filesystem created:       Wed Jan  8 16:34:20 2014

Last mount time:          Tue Mar 25 18:14:01 2014

Last write time:          Tue Mar 25 18:14:01 2014

Mount count:              55

Maximum mount count:      -1

Last checked:             Fri Jan 10 09:55:36 2014

Check interval:           0 (<none>)

Lifetime writes:          484 GB

Reserved blocks uid:      0 (user root)

Reserved blocks gid:      0 (group root)

First inode:              11

Inode size:             256

Required extra isize:     28

Desired extra isize:      28

Journal inode:            8

Default directory hash:   half_md4

Directory Hash Seed:      52a5955f-192b-42a5-937b-358cae79edc2

Journal backup:           inode blocks

Journal features:         journal_incompat_revoke

Journal size:             128M

Journal length:           32768

Journal sequence:         0x0000167a

Journal start:            1
```

Hope this helps. One last thing: I use LVM with sda but not with sdb. Filesystem on sda3 (/boot) is ext3. Don't know if that matters though.

----------

## aCOSwt

- OK then from hdparm's report I conclude that both drives have equivalent throughput. => Hard disk hardware problem unlikely.

- Much surprising is that the "lagging" filesystem shows a blocksize of 4K which should theoretically offer a much better throughput than your 1K blocksize boot filesystem.   :Confused: 

- If we were tracking µs, I would say that having an inode size of 256 + extra_isize feature do not help challenging with your boot filesystem (inode size=128) but... we are not yet tracking µs.

=> Filesystem's parameters problem unlikely.

Up one level :

What do cat /sys/block/sd?/queue/scheduler and cat /sys/block/sd?/queue/nomerges tell for both devices ?

----------

## VinzC

 *aCOSwt wrote:*   

> What do cat /sys/block/sd?/queue/scheduler and cat /sys/block/sd?/queue/nomerges tell for both devices ?

 

Here are the readings:

```
noop [deadline] cfq bfq 

noop [deadline] cfq bfq
```

```
0

0
```

Already feels relieved it doesn't look like a hardware issue. Well, there's always a "but"  :Very Happy:  .

EDIT: I've just checked smartctl and it reports a firmware might be available for the lagging drive, maybe that's one hypothesis to investigate?

```
...

==> WARNING: A firmware update for this drive may be available,

see the following Seagate web pages:

http://knowledge.seagate.com/articles/en_US/FAQ/207931en

http://knowledge.seagate.com/articles/en_US/FAQ/213891en

...
```

----------

## Anon-E-moose

You might try using either cfq or bfq as default instead of deadline and see if that has an effect.

----------

## VinzC

 *Anon-E-moose wrote:*   

> You might try using either cfq or bfq as default instead of deadline and see if that has an effect.

 

Thanks for the hint, too. I guess I can change this without rebooting or recompiling, right?

EDIT: Found http://www.linuxhowtos.org/System/iosched.htm . Will try and report but not earlier than in 1-2 days.

----------

## Anon-E-moose

According to this, yes.

http://www.admon.org/system-tuning/how-to-change-default-io-scheduler/

It's a five year old article, but it should still be accurate, as far as schedulers.

I would assume you need to be root to do it.

Edit to add: If it works and you want it permanent, then either recompile the kernel setting whichever one you want as default.

The article says there is a kernel parm that can be set, but I haven't checked to see if it's still there.

----------

## ulenrich

 *Anon-E-moose wrote:*   

> You might try using either cfq or bfq as default instead of deadline and see if that has an effect.

  Why the hell patching the kernel with bfq in the first place if he uses deadline. It is an intrusive patch ...

----------

## Anon-E-moose

 *ulenrich wrote:*   

>  *Anon-E-moose wrote:*   You might try using either cfq or bfq as default instead of deadline and see if that has an effect.  Why the hell patching the kernel with bfq in the first place if he uses deadline. It is an intrusive patch ...

 

Well, it's their choice, so I suppose they have their reasons.

I personally run bfq and have had 0 problems with it.

As far as intrusive patches and software...I'll save it for another thread.   :Wink: 

----------

## ulenrich

I think gentoo-sources does include the bfq patch: Perhaps there has to be a .config policy forcing bfq then ...

@VinzC, if your system runs well with bfq this is worth mentioning at https://bugs.gentoo.org/

----------

## Anon-E-moose

I've been using zen sources, and they include bfq.

Within the config section of the kernel, General Setup -> CPU Scheduler whichever one you select becomes the default.

Edit to add: The above was for the cpu scheduler   :Embarassed: 

IO scheduler is under "Enable the block layer -> IO Schedulers"

----------

## VinzC

Back after a some testing. 

 *Anon-E-moose wrote:*   

> You might try using either cfq or bfq as default instead of deadline and see if that has an effect.

  *ulenrich wrote:*   

>  Why the hell patching the kernel with bfq in the first place if he uses deadline. It is an intrusive patch ...

 

First off, bfq, cfq, as, deadline, well, consider that's Chinese to me. Being no hardware specialist I nevertheless dug and actually read the kernel documentation about these schedulers. That was *my* personal choice to select deadline as I (thought I understood that I) wanted performance. I didn't (and still don't) bother why bfq is here while I chose deadline. I made an educated guess. Or not. But who cares.

I've been using deadline for quite some time and it's only now that I have a secondary disk that the issue occurred. Maybe deadline doesn't play well when there's more than one disk with that scheduler. I don't know. Don't care either. I'll leave this question to specialists.

In the end, I read the (F***) documentation about those schedulers, which is what all good Linux user is meant to do, right  :Very Happy:  ? So I have come up with a slight change. I left sdb (media disk) with deadline while I set sda (main disk) to cfq. Why? Because if I understood correctly, deadline stresses on punctuality while others stress on fairness. With my media disk I want the smallest latency while my main disk can wait a little. I can bear video lag less than a small latency in daily desktop usage. So far, no lag observed yet.

Anyway thanks a lot guys for your help and precious hints.

Still testing.

EDIT: Oh, and yes, Gentoo Sources come with BFQ patch.

----------

## Anon-E-moose

I have cfq and bfq compiled in the kernel, but I get excellent performance with bfq so I just leave it at that.

It is just a modification with a little tweaking of cfq though.

Hope it all works out.

----------

## VinzC

 *Anon-E-moose wrote:*   

> I have cfq and bfq compiled in the kernel, but I get excellent performance with bfq so I just leave it at that.
> 
> It is just a modification with a little tweaking of cfq though.
> 
> Hope it all works out.

 

I'll certainly go further than just deadline. So far it is more a sef-check to see if my understanding is correct — I'm already so glad Linux allowed me to get thus far in learning about the system. I'm also curious as to how deadline compares with cfq and bfq. Not that I care that much on the technical aspects but I'll see if I get lagging and which combination(s) give(s) me what I want.

----------

## ulenrich

Yes, please try to add on your grub cmdline:

---

   elevator=   [IOSCHED]

            Format: {"cfq" | "deadline" | "noop"}

---

and of course bfq. Details of each of them look at files:

/usr/src/linux/Documentation/block/cfq-iosched.txt

... and alike: Many thing you can tune.

----------

## Anon-E-moose

VinzC, do keep us informed of how your tests go.

I'm sure the info will be helpful to many.

Edit to add: For a brief intro to bfq. http://algo.ing.unimo.it/people/paolo/disk_sched/

----------

## VinzC

Back again after some more testing.

Did 2 more:bfq for both drivesbfq for sda (main) and deadline for sdb (media)With configuration 1. I got random slight glitches, very short but perceptible, i.e. milliseconds maybe. In short no very long pauses. With configuration 2. everything ran smoothly and I got not the slightest glitch at all, video ran perfectly.

So my temporarily definite conclusion is that deadline fits best with video storage *and* the other disks use ?fq schedulers. From my observations, there should be only one disk running with the deadline scheduler. YMMV though. With no specific constraint, ?fq on all disks is a wise choice. With only one disk, deadline gives better [i.e. more stable] throughput.

In short:  *Quote:*   

> if you have multiple disks, use deadline with [the one] disk that needs stable throughput and lowest latency, like video streaming use cases, especially HD. Otherwise set one of [cb]fq globally.

 

Context, of course being gentoo-sources-3.10.* . Haven't tried a more recent kernel nor other kernel sources. Yet...

----------

## Anon-E-moose

Thanks, that's impressive.

----------

## VinzC

 *Anon-E-moose wrote:*   

> Thanks, that's impressive.

 

It's a common... «detail» I've often seen in benchmarks in fact. These mainly focus on throughput and performance but that's really just one aspect of a filesystem's features. Latency (from what I've seen and I've seen not that much to be honest) is not quite focused on. But it, along with throughput stability, has a non negligible impact on user experience but it's only in specific contexts and you cannot measure everything and all cases. Maximizing throughput is important when transferring files since small pauses really do not matter. However it's a totally different thing when watching a video, for instance. I'm glad I went through this anyway as it's been quite instructive, thanks to all of your hints, guys.

To mods: wouldn't this topic be a good candidate for sticky-ness  :Wink:  ?

----------

## aCOSwt

 *VinzC wrote:*   

> To mods: wouldn't this topic be a good candidate for sticky-ness  ?

 

Not yet!

First because, for having oriented a direction on which everybody embarked, I am still not convinzc that the root cause has actually been found.

----------

## Anon-E-moose

Stickies aren't always because a problem is solved but because a discussion or points brought out has some benefit to other users in general or other views/help may show up. IMO.

----------

## VinzC

 *Anon-E-moose wrote:*   

> Stickies aren't always because a problem is solved but because a discussion or points brought out has some benefit to other users in general or other views/help may show up. IMO.

 

No problemo.

Anyway I spoke too soon. The problem is back. Testing phase probably was not thorough enough and the issue occurred again after a couple of hours watching videos on the same disk. I've installed smartmon-tools just in case. So my conclusions might be valid in cases but this one  :Very Happy:  .

----------

## VinzC

I think the problem is finally fixed. I upgraded the second disk's firmware and now the problem seems to be gone definitely. I played a few hours video files without a hickup. Running smartctl against the drive indeed showed it needed a firmware upgraded, as I wrote earlier. I just followed the instructions hinted by the links.

I had also noticed that smartctl was delivering its information slower when querying the second hard drive, 1-2 seconds in fact. Now smartctl -a /dev/sdb is immediate. The firmware version was <something>42. Now it's CC49.

 *VinzC wrote:*   

> To mods: wouldn't this topic be a good candidate for sticky-ness  ?

 

 *aCOSwt wrote:*   

> Not yet!
> 
> First because, for having oriented a direction on which everybody embarked, I am still not convinzc that the root cause has actually been found.

 

LOL... sorry, I've just noticed the pun  :Very Happy:  . Nice one.

----------

