# [SOLVED] kernel 4.12: cpu stall with dm-raid

## ecko

Hello, since I upgraded from 4.11 to 4.12, I get cpu stalls at random moments (system is desktop for office work, mostly idle). During the event, I/O is frozen (including SATA disk and USB mouse, but PS/2 keyboard is fine); programs in memory are responsive (as long as they don't need I/O). Unix utility "top" reports  md_raid occupying 100% of a core (the /home is raid1 from the linux kernel), while iotop reports no particular I/O activity.

What can I do?

dmesg below (running gentoo-sources-4.12.4)

```

[  249.148386] INFO: rcu_sched self-detected stall on CPU

[  249.148390]  0-...: (2099 ticks this GP) idle=27e/140000000000001/0 softirq=4532/4533 fqs=1049 

[  249.148390]   (t=2100 jiffies g=2467 c=2466 q=27)

[  249.148392] NMI backtrace for cpu 0

[  249.148393] CPU: 0 PID: 3162 Comm: md0_raid1 Not tainted 4.12.4-gentoo #1

[  249.148394] Hardware name: System manufacturer System Product Name/P8P67 PRO, BIOS 1253 01/20/2011

[  249.148394] Call Trace:

[  249.148396]  <IRQ>

[  249.148399]  dump_stack+0x4d/0x67

[  249.148401]  nmi_cpu_backtrace+0x95/0xa0

[  249.148411]  ? irq_force_complete_move+0xe0/0xe0

[  249.148412]  nmi_trigger_cpumask_backtrace+0x91/0xc0

[  249.148413]  arch_trigger_cpumask_backtrace+0x14/0x20

[  249.148415]  rcu_dump_cpu_stacks+0x93/0xce

[  249.148417]  rcu_check_callbacks+0x767/0x8b0

[  249.148419]  ? tick_sched_handle.isra.7+0x30/0x30

[  249.148420]  update_process_times+0x2a/0x50

[  249.148421]  tick_sched_handle.isra.7+0x29/0x30

[  249.148422]  tick_sched_timer+0x3d/0x70

[  249.148423]  __hrtimer_run_queues+0xda/0x210

[  249.148424]  hrtimer_interrupt+0xac/0x1f0

[  249.148426]  local_apic_timer_interrupt+0x33/0x50

[  249.148427]  smp_apic_timer_interrupt+0x33/0x50

[  249.148429]  apic_timer_interrupt+0x86/0x90

[  249.148430] RIP: 0010:mutex_lock+0x10/0x30

[  249.148430] RSP: 0018:ffffc900004efd58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10

[  249.148431] RAX: 0000000000000000 RBX: ffff88039da3c000 RCX: ffff88039d482400

[  249.148432] RDX: ffff88039d992cc0 RSI: 0000000000000000 RDI: ffff88039da3c368

[  249.148432] RBP: ffffc900004efd98 R08: 0000000000000000 R09: 0000000000000000

[  249.148433] R10: ffffc900004efeb0 R11: 0000000000000000 R12: ffff88039da3c000

[  249.148433] R13: ffff88039d482428 R14: ffff88039d992cc0 R15: 0000000000000000

[  249.148434]  </IRQ>

[  249.148436]  ? bitmap_daemon_work+0x27/0x340

[  249.148438]  md_check_recovery+0x22/0x460

[  249.148440]  raid1d+0x4c/0x900 [raid1]

[  249.148442]  md_thread+0x115/0x140

[  249.148442]  ? md_thread+0x115/0x140

[  249.148444]  ? wake_atomic_t_function+0x60/0x60

[  249.148445]  kthread+0x104/0x140

[  249.148446]  ? md_register_thread+0xe0/0xe0

[  249.148447]  ? kthread_create_on_node+0x40/0x40

[  249.148448]  ret_from_fork+0x22/0x30

```

Last edited by ecko on Wed Sep 13, 2017 8:34 am; edited 1 time in total

----------

## LIsLinuxIsSogood

If I were you (and I'm not)...have you tried booting into single user mode without the /home partition mounted.  If you can gain access to the operating system without any reliance on the second disk (mirror) you may be able to isolate if it is related at all to the newly added RAID features for the kernel, which were shown here (https://fossbytes.com/linux-kernel-4-12-download-features/)

It is a shot in the dark, but since all RAID features rely on two or more disks, perhaps there is a related bug, or else if you do see the problem go away after detaching the mirror then you might be able to add it back afterwards (problem-free).

Any luck?

----------

## ecko

 *LIsLinuxIsSogood wrote:*   

> have you tried booting into single user mode without the /home partition mounted?

 

Thanks for the suggestion. I rebooted with home unmounted (added option noauto in fstab) and let the machine at the X login screen during 10 hours at night; no problem happened. When I mounted /home and logged into the system, the problem happened after 3 hours. 

(To make sure I will repeat during the night when the machine is totally idle.) The test was done with 4.12.5 (released 2 days ago with 3 commits related to raid).

I just noticed in the logs that the stall is often (but not always) followed, exactly 30 seconds later, by complains regarding the clocksource.

```

clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:

clocksource:                       'hpet' wd_now: c3882729 wd_last: 3045c21d mask: ffffffff

clocksource:                       'tsc' cs_now: 9788d643a734 cs_last: 96ffcb8fa416 mask: ffffffffffffffff

sched_clock: Marking unstable (48823520000829, 1115764495)<-(48824723369567, -87604243)

tsc: Marking TSC unstable due to clocksource watchdog

clocksource: Switched to clocksource hpet

```

----------

## radio_flyer

You're not running KDE are you? If so, Baloo will hang I/O hard for that long.

----------

## ecko

 *radio_flyer wrote:*   

> You're not running KDE are you? If so, Baloo will hang I/O hard for that long.

 

I use a simple fluxbox setup and baloo is not installed. I use app-misc/recoll as indexer, it updates on a cron job at a known time of the day (and not correlated to the observed problem). Also iotop does not report I/O activity during the problem, so I was thinking of an I/O lockup due to a bug in the linux raid code. I am now in the process of bissecting the kernel. The problem sometimes only shows up after 1 day of uptime, so I will need one more week to go through the remaining 10 bissecting steps.

----------

## snIP3r

hi all!

i have similar issue:

```

Aug 21 18:45:28 area52 kernel: INFO: rcu_sched self-detected stall on CPU

Aug 21 18:45:28 area52 kernel: \x090-...: (2099 ticks this GP) idle=53a/140000000000001/0 softirq=871134/871134 fqs=1049

Aug 21 18:45:28 area52 kernel: \x09 (t=2100 jiffies g=777122 c=777121 q=140)

Aug 21 18:45:28 area52 kernel: NMI backtrace for cpu 0

Aug 21 18:45:28 area52 kernel: CPU: 0 PID: 2480 Comm: md127_raid1 Not tainted 4.12.5-gentoo #1

Aug 21 18:45:28 area52 kernel: Hardware name: ASUSTeK COMPUTER INC. P9D-X Series/P9D-X Series, BIOS 0704 03/28/2014

Aug 21 18:45:28 area52 kernel: Call Trace:

Aug 21 18:45:28 area52 kernel:  <IRQ>

Aug 21 18:45:28 area52 kernel:  dump_stack+0x4d/0x6a

Aug 21 18:45:28 area52 kernel:  nmi_cpu_backtrace+0x9b/0xa0

Aug 21 18:45:28 area52 kernel:  ? irq_force_complete_move+0xf0/0xf0

Aug 21 18:45:28 area52 kernel:  nmi_trigger_cpumask_backtrace+0x8f/0xc0

Aug 21 18:45:28 area52 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20

Aug 21 18:45:28 area52 kernel:  rcu_dump_cpu_stacks+0x8f/0xca

Aug 21 18:45:28 area52 kernel:  rcu_check_callbacks+0x701/0x850

Aug 21 18:45:28 area52 kernel:  ? tick_sched_handle.isra.17+0x30/0x30

Aug 21 18:45:28 area52 kernel:  update_process_times+0x2a/0x50

Aug 21 18:45:28 area52 kernel:  tick_sched_handle.isra.17+0x2d/0x30

Aug 21 18:45:28 area52 kernel:  tick_sched_timer+0x38/0x70

Aug 21 18:45:28 area52 kernel:  __hrtimer_run_queues+0xde/0x210

Aug 21 18:45:28 area52 kernel:  hrtimer_interrupt+0xa3/0x190

Aug 21 18:45:28 area52 kernel:  local_apic_timer_interrupt+0x33/0x60

Aug 21 18:45:28 area52 kernel:  smp_apic_timer_interrupt+0x33/0x50

Aug 21 18:45:28 area52 kernel:  apic_timer_interrupt+0x86/0x90

Aug 21 18:45:28 area52 kernel: RIP: 0010:md_check_recovery+0x5b/0x460

Aug 21 18:45:28 area52 kernel: RSP: 0018:ffffc9000229bda8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10

Aug 21 18:45:28 area52 kernel: RAX: 0000000000000000 RBX: ffff880220eb8800 RCX: ffff88022638b500

Aug 21 18:45:28 area52 kernel: RDX: ffffc9000229be40 RSI: 0000000000000000 RDI: ffff880220eb8800

Aug 21 18:45:28 area52 kernel: RBP: ffffc9000229bdc0 R08: 0000000000000000 R09: 0000000000001b5d

Aug 21 18:45:28 area52 kernel: R10: ffffc9000229beb0 R11: 0000000000000000 R12: ffff880220eb8800

Aug 21 18:45:28 area52 kernel: R13: ffff88022638b528 R14: ffff880220e95480 R15: 0000000000000000

Aug 21 18:45:28 area52 kernel:  </IRQ>

Aug 21 18:45:28 area52 kernel:  raid1d+0x4c/0x7f0

Aug 21 18:45:28 area52 kernel:  md_thread+0x10d/0x140

Aug 21 18:45:28 area52 kernel:  ? md_thread+0x10d/0x140

Aug 21 18:45:28 area52 kernel:  ? wake_up_bit+0x30/0x30

Aug 21 18:45:28 area52 kernel:  kthread+0x104/0x140

Aug 21 18:45:28 area52 kernel:  ? md_register_thread+0xe0/0xe0

Aug 21 18:45:28 area52 kernel:  ? kthread_create_on_node+0x40/0x40

Aug 21 18:45:28 area52 kernel:  ret_from_fork+0x22/0x30

```

or

```

Aug 21 19:49:11 area52 kernel: INFO: rcu_sched self-detected stall on CPU

Aug 21 19:49:11 area52 kernel: \x090-...: (2099 ticks this GP) idle=642/140000000000001/0 softirq=1024655/1024655 fqs=1049

Aug 21 19:49:11 area52 kernel: \x09 (t=2100 jiffies g=912276 c=912275 q=121)

Aug 21 19:49:11 area52 kernel: NMI backtrace for cpu 0

Aug 21 19:49:11 area52 kernel: CPU: 0 PID: 2489 Comm: md124_raid1 Not tainted 4.12.5-gentoo #1

Aug 21 19:49:11 area52 kernel: Hardware name: ASUSTeK COMPUTER INC. P9D-X Series/P9D-X Series, BIOS 0704 03/28/2014

Aug 21 19:49:11 area52 kernel: Call Trace:

Aug 21 19:49:11 area52 kernel:  <IRQ>

Aug 21 19:49:11 area52 kernel:  dump_stack+0x4d/0x6a

Aug 21 19:49:11 area52 kernel:  nmi_cpu_backtrace+0x9b/0xa0

Aug 21 19:49:11 area52 kernel:  ? irq_force_complete_move+0xf0/0xf0

Aug 21 19:49:11 area52 kernel:  nmi_trigger_cpumask_backtrace+0x8f/0xc0

Aug 21 19:49:11 area52 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20

Aug 21 19:49:11 area52 kernel:  rcu_dump_cpu_stacks+0x8f/0xca

Aug 21 19:49:11 area52 kernel:  rcu_check_callbacks+0x701/0x850

Aug 21 19:49:11 area52 kernel:  ? tick_sched_handle.isra.17+0x30/0x30

Aug 21 19:49:11 area52 kernel:  update_process_times+0x2a/0x50

Aug 21 19:49:11 area52 kernel:  tick_sched_handle.isra.17+0x2d/0x30

Aug 21 19:49:11 area52 kernel:  tick_sched_timer+0x38/0x70

Aug 21 19:49:11 area52 kernel:  __hrtimer_run_queues+0xde/0x210

Aug 21 19:49:11 area52 kernel:  hrtimer_interrupt+0xa3/0x190

Aug 21 19:49:11 area52 kernel:  local_apic_timer_interrupt+0x33/0x60

Aug 21 19:49:11 area52 kernel:  smp_apic_timer_interrupt+0x33/0x50

Aug 21 19:49:11 area52 kernel:  apic_timer_interrupt+0x86/0x90

Aug 21 19:49:11 area52 kernel: RIP: 0010:_raw_spin_lock_irqsave+0x6/0x30

Aug 21 19:49:11 area52 kernel: RSP: 0018:ffffc900022e3db0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10

Aug 21 19:49:11 area52 kernel: RAX: 0000000000000000 RBX: ffff88022615c414 RCX: 0000000000000000

Aug 21 19:49:11 area52 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88022615c414

Aug 21 19:49:11 area52 kernel: RBP: ffffc900022e3dc0 R08: 0000000000000000 R09: 0000000000000d9b

Aug 21 19:49:11 area52 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff880220f26800

Aug 21 19:49:11 area52 kernel: R13: ffff88022615c428 R14: ffff8802263ca780 R15: 0000000000000000

Aug 21 19:49:11 area52 kernel:  </IRQ>

Aug 21 19:49:11 area52 kernel:  raid1d+0xa1/0x7f0

Aug 21 19:49:11 area52 kernel:  md_thread+0x10d/0x140

Aug 21 19:49:11 area52 kernel:  ? md_thread+0x10d/0x140

Aug 21 19:49:11 area52 kernel:  ? wake_up_bit+0x30/0x30

Aug 21 19:49:11 area52 kernel:  kthread+0x104/0x140

Aug 21 19:49:11 area52 kernel:  ? md_register_thread+0xe0/0xe0

Aug 21 19:49:11 area52 kernel:  ? kthread_create_on_node+0x40/0x40

Aug 21 19:49:11 area52 kernel:  ret_from_fork+0x22/0x30

```

or this

```

Aug 21 20:58:30 area52 kernel: INFO: rcu_sched self-detected stall on CPU

Aug 21 20:58:30 area52 kernel: \x090-...: (2099 ticks this GP) idle=f1a/140000000000001/0 softirq=1200414/1200414 fqs=1049

Aug 21 20:58:30 area52 kernel: \x09 (t=2100 jiffies g=1041137 c=1041136 q=198)

Aug 21 20:58:30 area52 kernel: NMI backtrace for cpu 0

Aug 21 20:58:30 area52 kernel: CPU: 0 PID: 2489 Comm: md124_raid1 Tainted: G        W       4.12.5-gentoo #1

Aug 21 20:58:30 area52 kernel: Hardware name: ASUSTeK COMPUTER INC. P9D-X Series/P9D-X Series, BIOS 0704 03/28/2014

Aug 21 20:58:30 area52 kernel: Call Trace:

Aug 21 20:58:30 area52 kernel:  <IRQ>

Aug 21 20:58:30 area52 kernel:  dump_stack+0x4d/0x6a

Aug 21 20:58:30 area52 kernel:  nmi_cpu_backtrace+0x9b/0xa0

Aug 21 20:58:40 area52 kernel:  ? irq_force_complete_move+0xf0/0xf0

Aug 21 20:58:40 area52 kernel:  nmi_trigger_cpumask_backtrace+0x8f/0xc0

Aug 21 20:58:40 area52 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20

Aug 21 20:58:40 area52 kernel:  rcu_dump_cpu_stacks+0x8f/0xca

Aug 21 20:58:40 area52 kernel:  rcu_check_callbacks+0x701/0x850

Aug 21 20:58:40 area52 kernel:  ? tick_sched_handle.isra.17+0x30/0x30

Aug 21 20:58:40 area52 kernel:  update_process_times+0x2a/0x50

Aug 21 20:58:40 area52 kernel:  tick_sched_handle.isra.17+0x2d/0x30

Aug 21 20:58:40 area52 kernel:  tick_sched_timer+0x38/0x70

Aug 21 20:58:40 area52 kernel:  __hrtimer_run_queues+0xde/0x210

Aug 21 20:58:40 area52 kernel:  hrtimer_interrupt+0xa3/0x190

Aug 21 20:58:40 area52 kernel:  local_apic_timer_interrupt+0x33/0x60

Aug 21 20:58:40 area52 kernel:  smp_apic_timer_interrupt+0x33/0x50

Aug 21 20:58:40 area52 kernel:  apic_timer_interrupt+0x86/0x90

Aug 21 20:58:40 area52 kernel: RIP: 0010:raid1d+0x47/0x7f0

Aug 21 20:58:40 area52 kernel: RSP: 0018:ffffc900022e3dd0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10

Aug 21 20:58:40 area52 kernel: RAX: ffff88022615c418 RBX: ffff88022615c400 RCX: ffff88022615c400

Aug 21 20:58:40 area52 kernel: RDX: ffffc900022e3e40 RSI: 0000000000000000 RDI: ffff880220f26800

Aug 21 20:58:40 area52 kernel: RBP: ffffc900022e3e90 R08: 0000000000000000 R09: 00000000000010e3

Aug 21 20:58:40 area52 kernel: R10: ffffc900022e3eb0 R11: 0000000000000000 R12: ffff880220f26800

Aug 21 20:58:40 area52 kernel: R13: ffff88022615c428 R14: ffff8802263ca780 R15: 0000000000000000

Aug 21 20:58:40 area52 kernel:  </IRQ>

Aug 21 20:58:40 area52 kernel:  md_thread+0x10d/0x140

Aug 21 20:58:40 area52 kernel:  ? md_thread+0x10d/0x140

Aug 21 20:58:40 area52 kernel:  ? wake_up_bit+0x30/0x30

Aug 21 20:58:40 area52 kernel:  kthread+0x104/0x140

Aug 21 20:58:40 area52 kernel:  ? md_register_thread+0xe0/0xe0

Aug 21 20:58:40 area52 kernel:  ? kthread_create_on_node+0x40/0x40

Aug 21 20:58:40 area52 kernel:  ret_from_fork+0x22/0x30

```

and as far as i have analyzed it, its related to my raid config if my raiddisks (two sata drives) will spin up after they were in idle mode. running my previusly used kernel 4.4.6 had no such errors. so i also will check the newly introduced features...

perhaps someone has an idea about the issue?

greets

snIP3r

----------

## snIP3r

looks like this is about our issue:

https://lkml.org/lkml/2017/8/6/197

----------

## araxon

Same here. Under high disk load, the server throws similar message and then stops all disk I/O. It is not even able to write an error log, so it took me days to track it down. But I managed to log errors remotely, as I noticed that the networking lives a bit longer. There is no RAID5/6 on the server, only RAID1, but the error seems md_raid related.

I am able to reproduce the crash pretty regularly on this hardware, so if you have anything non-destructive that can be tried, I may be able to test it.

```
Sep  5 18:55:26 10.0.0.149 kernel: INFO: rcu_sched self-detected stall on CPU

Sep  5 18:55:26 10.0.0.149 kernel: \x090-...: (2099 ticks this GP) idle=c7e/140000000000001/0 softirq=1075486/1075486 fqs=1049

Sep  5 18:55:26 10.0.0.149 kernel: \x09 (t=2100 jiffies g=588759 c=588758 q=3532)

Sep  5 18:55:26 10.0.0.149 kernel: NMI backtrace for cpu 0

Sep  5 18:55:26 10.0.0.149 kernel: CPU: 0 PID: 124 Comm: md3_raid1 Tainted: G        W       4.12.5-gentoo #1

Sep  5 18:55:26 10.0.0.149 kernel: Hardware name: HPE ML10Gen9/ML10Gen9, BIOS 1.003 07/27/2016

Sep  5 18:55:26 10.0.0.149 kernel: Call Trace:

Sep  5 18:55:26 10.0.0.149 kernel:  <IRQ>

Sep  5 18:55:26 10.0.0.149 kernel:  dump_stack+0x4d/0x6a

Sep  5 18:55:26 10.0.0.149 kernel:  nmi_cpu_backtrace+0x95/0xa0

Sep  5 18:55:26 10.0.0.149 kernel:  ? irq_force_complete_move+0xf0/0xf0

Sep  5 18:55:26 10.0.0.149 kernel:  nmi_trigger_cpumask_backtrace+0x88/0xd0

Sep  5 18:55:26 10.0.0.149 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20

Sep  5 18:55:26 10.0.0.149 kernel:  rcu_dump_cpu_stacks+0x93/0xce

Sep  5 18:55:26 10.0.0.149 kernel:  rcu_check_callbacks+0x767/0x8b0

Sep  5 18:55:26 10.0.0.149 kernel:  ? acct_account_cputime+0x17/0x20

Sep  5 18:55:26 10.0.0.149 kernel:  ? tick_sched_do_timer+0x40/0x40

Sep  5 18:55:26 10.0.0.149 kernel:  update_process_times+0x2a/0x50

Sep  5 18:55:26 10.0.0.149 kernel:  tick_sched_handle.isra.15+0x2d/0x40

Sep  5 18:55:26 10.0.0.149 kernel:  tick_sched_timer+0x38/0x70

Sep  5 18:55:26 10.0.0.149 kernel:  __hrtimer_run_queues+0xda/0x210

Sep  5 18:55:26 10.0.0.149 kernel:  hrtimer_interrupt+0xac/0x1f0

Sep  5 18:55:26 10.0.0.149 kernel:  local_apic_timer_interrupt+0x33/0x50

Sep  5 18:55:26 10.0.0.149 kernel:  smp_apic_timer_interrupt+0x33/0x50

Sep  5 18:55:26 10.0.0.149 kernel:  apic_timer_interrupt+0x86/0x90

Sep  5 18:55:26 10.0.0.149 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20

Sep  5 18:55:26 10.0.0.149 kernel: RSP: 0018:ffffc9000065fda0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10

Sep  5 18:55:26 10.0.0.149 kernel: RAX: 0000000000000000 RBX: ffff8802693ee800 RCX: 0000000000000001

Sep  5 18:55:26 10.0.0.149 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8802693eea80

Sep  5 18:55:26 10.0.0.149 kernel: RBP: ffffc9000065fdc0 R08: 0000000000000000 R09: 0000000000000000

Sep  5 18:55:26 10.0.0.149 kernel: R10: ffffc9000065feb0 R11: 0000000000000000 R12: 0000000000000000

Sep  5 18:55:26 10.0.0.149 kernel: R13: ffff8802681c0328 R14: ffff880268fea880 R15: 0000000000000000

Sep  5 18:55:26 10.0.0.149 kernel:  </IRQ>

Sep  5 18:55:26 10.0.0.149 kernel:  ? md_check_recovery+0x2b7/0x460

Sep  5 18:55:26 10.0.0.149 kernel:  raid1d+0x4c/0x8e0

Sep  5 18:55:26 10.0.0.149 kernel:  md_thread+0x115/0x140

Sep  5 18:55:26 10.0.0.149 kernel:  ? md_thread+0x115/0x140

Sep  5 18:55:26 10.0.0.149 kernel:  ? wake_atomic_t_function+0x60/0x60

Sep  5 18:55:26 10.0.0.149 kernel:  kthread+0x103/0x140

Sep  5 18:55:26 10.0.0.149 kernel:  ? find_pers+0x70/0x70

Sep  5 18:55:26 10.0.0.149 kernel:  ? kthread_create_on_node+0x40/0x40

Sep  5 18:55:26 10.0.0.149 kernel:  ret_from_fork+0x22/0x30
```

----------

## snIP3r

yes, it's md related. i switched back to my former used kernel - no such errors. so for me i am waiting for the next stable kernel...

----------

## ecko

My bissection lead this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d5e72dfdf0fa29a21143fd72746c6f43295ce9f "This update includes the usual round of major driver updates". 

I did some limited testing with 4.13-rc7 and for now the problem did not show up. I'll test for longer with 4.13 before declaring it solved.

----------

## ecko

After several days of tests, the problem does not happen with kernel 4.13.

----------

## araxon

 *ecko wrote:*   

> After several days of tests, the problem does not happen with kernel 4.13.

 

I'm trying to trigger the error all day on kernel 4.12.12, and so far it seems fixed there as well.

----------

## masc

 *araxon wrote:*   

>  *ecko wrote:*   After several days of tests, the problem does not happen with kernel 4.13. 
> 
> I'm trying to trigger the error all day on kernel 4.12.12, and so far it seems fixed there as well.

 

it seems to be fixed in `4.12.11` as well.

----------

## peppev

 *masc wrote:*   

>  *araxon wrote:*    *ecko wrote:*   After several days of tests, the problem does not happen with kernel 4.13. 
> 
> I'm trying to trigger the error all day on kernel 4.12.12, and so far it seems fixed there as well. 
> 
> it seems to be fixed in `4.12.11` as well.

 

One of my systems is under the stable 4.12.15  and this morning dmesg reports:

```

[154509.424066] INFO: rcu_sched self-detected stall on CPU

[154509.424075]         0-...: (59999 ticks this GP) idle=32a/140000000000001/0 softirq=7363992/7363992 fqs=14633 

[154509.424076]          (t=60000 jiffies g=3560363 c=3560362 q=845)

[154509.424081] NMI backtrace for cpu 0

[154509.424087] CPU: 0 PID: 3095 Comm: md3_raid1 Tainted: P           O    4.12.5-gentoo #1

[154509.424088] Hardware name:                  /D925XECV2                      , BIOS CV92510A.86A.0504.2006.1128.1903 11/28/2006

[154509.424090] Call Trace:

[154509.424093]  <IRQ>

[154509.424101]  dump_stack+0x4d/0x63

[154509.424105]  nmi_cpu_backtrace+0x76/0x85

[154509.424109]  ? irq_force_complete_move+0xd5/0xd5

[154509.424112]  nmi_trigger_cpumask_backtrace+0x51/0xb2

[154509.424115]  arch_trigger_cpumask_backtrace+0x14/0x16

[154509.424119]  rcu_dump_cpu_stacks+0x89/0xb6

[154509.424123]  rcu_check_callbacks+0x232/0x5eb

[154509.424127]  ? raise_softirq_irqoff+0x9/0x1e

[154509.424130]  update_process_times+0x2a/0x4f

[154509.424134]  tick_sched_handle+0x2f/0x3b

[154509.424136]  tick_sched_timer+0x34/0x5a

[154509.424139]  __hrtimer_run_queues+0xba/0x182

[154509.424142]  hrtimer_interrupt+0x67/0x105

[154509.424145]  local_apic_timer_interrupt+0x46/0x49

[154509.424148]  smp_apic_timer_interrupt+0x24/0x34

[154509.424152]  apic_timer_interrupt+0x86/0x90

[154509.424157] RIP: 0010:do_raw_spin_lock+0xd/0x1c

[154509.424159] RSP: 0018:ffffc90000d03da0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10

[154509.424162] RAX: 0000000000000000 RBX: ffff880092059800 RCX: 0000000000000000

[154509.424164] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880092059a80

[154509.424166] RBP: ffffc90000d03da8 R08: ffff880037810000 R09: ffff880037810000

[154509.424168] R10: ffff8800920599e8 R11: 0000000000000372 R12: ffff88009167e300

[154509.424170] R13: ffff8800947f93d0 R14: ffff880037810000 R15: 0000000000000000

[154509.424172]  </IRQ>

[154509.424176]  ? _raw_spin_lock+0x9/0xb

[154509.424180]  md_check_recovery+0x21c/0x3cc

[154509.424184]  raid1d+0x3b/0x6e5

[154509.424187]  md_thread+0x110/0x14a

[154509.424190]  ? md_thread+0x110/0x14a

[154509.424193]  ? wake_up_atomic_t+0x27/0x27

[154509.424195]  ? md_do_sync+0xca0/0xca0

[154509.424199]  kthread+0xf7/0xfc

[154509.424202]  ? init_completion+0x23/0x23

[154509.424204]  ret_from_fork+0x22/0x30

```

Never done in the previous stable gentoo-sources kernels (last one was 4.9.34).

Does it look related to the problem discussed in this thread?

----------

## masc

 *peppev wrote:*   

> 
> 
> Does it look related to the problem discussed in this thread?

 

it certainly looks like it.

----------

## peppev

 *masc wrote:*   

>  *peppev wrote:*   
> 
> Does it look related to the problem discussed in this thread? 
> 
> it certainly looks like it.

 

Well, I guess we have a problem with 4.12.5?

In less than a week from when I emerged it, I found 3 blocking bugs.

1) the tape changer sg device driver doesn't work, needs a patch;

2) it has a very nasty bug in the netfilter conntrack code which lead to random panics;

3) it seems, from this thread, to have a problem with mdadm arrays.

Mmm ... stable?

----------

## araxon

 *peppev wrote:*   

> 
> 
> One of my systems is under the stable 4.12.15  and this morning dmesg reports:
> 
> ```
> ...

 

Seems like 4.12.5, not 4.12.15. In other words, it is the same exact bug we are discussing here. Try upgrading to later kernel, as suggested in this thread.

----------

## araxon

 *peppev wrote:*   

> Well, I guess we have a problem with 4.12.5?
> 
> In less than a week from when I emerged it, I found 3 blocking bugs.
> 
> 1) the tape changer sg device driver doesn't work, needs a patch;
> ...

 

4.12.5 seems to be removed from Gentoo portage already. Nothing much more to be done here.

----------

## peppev

 *araxon wrote:*   

>  *peppev wrote:*   
> 
> One of my systems is under the stable 4.12.15  and this morning dmesg reports:
> 
> ```
> ...

 

Apologies for the obvious "typo", of course is 4.12.5.

I installed 4.12.12, the mtx and conntrack bugs has been corrected, patches already available from months are present in the new kernel.

Let see if mdadm is solved, I've no idea how to check about this problem in the kernel source.

I may only say I had been, probably, really "unlucky" in my mothly emerge schedule, being "trapped" in such a bad shape kernel.

Though, some check before declaring a kernel "stable" would be appreciated.

It is really disappointing to see systems which had been solid as a rock for years, under Gentoo, with terabytes of data stored in their disks, to panic for a stupid "typo" (as was declared by the original developer in this thread: https://www.spinics.net/lists/kernel/msg2558062.html) like a crazy.

Hope to be more lucky in the future.

----------

## araxon

 *peppev wrote:*   

> 
> 
> Let see if mdadm is solved, I've no idea how to check about this problem in the kernel source.
> 
> 

 

I was able to crash my machine on kernel 4.12.5 (with the mdadm bug) in a matter of hours. I'm on 4.12.12 for past 10 days 24/7, and it seems that it does not have this particular bug anymore.

 *peppev wrote:*   

> 
> 
> Though, some check before declaring a kernel "stable" would be appreciated.
> 
> It is really disappointing to see systems which had been solid as a rock for years, under Gentoo, with terabytes of data stored in their disks, to panic for a stupid "typo" (as was declared by the original developer in this thread: https://www.spinics.net/lists/kernel/msg2558062.html) like a crazy.
> ...

 

Yes, that is embarrassing, and I too would be happier if it would not happen in the future. But we are getting this whole Gentoo miracle thing for free, so I'm grateful either way. Excellent value for the (zero) money spent.

----------

## peppev

 *araxon wrote:*   

>  *peppev wrote:*   
> 
> Let see if mdadm is solved, I've no idea how to check about this problem in the kernel source.
> 
>  
> ...

 

I'm grateful to Gentoo as you are, not only for the metadistro I may use in an "uncountable" number of ways, but also for keeping me so near to the upstream as it may be possible in our days, especially at my age ;-(

And I understand how "impossbile" may be to check all the "branches" a kernel may walk through running its code.

But this particular version of the kernel seems to have been distributed as "stable" in a very great hurry, missing a bunch of patches already available from months.

It never happened before in the seven years I used Gentoo in my "production" systems.

Just wondering why.

About the mdadm bug, I'm still in the "check state".

I've a dozen of systems still running 4.12.5, with mdadm arrays, which doesn't show trace of the problem.

Just one of my systems printed the "stall" warning in its dmesg, without any other apparent problem.

I bet this bug, if it is a "single bug",  it is not an "easy one" and may not be "over", at least until we find a kernel patch which show the reason of the stall message.

----------

## Hu

This is the typical problem caused by different definitions of "stable."  Upstream stable kernels start as the most recent Linus release (excluding release-candidates and snapshots), then add patches tagged as fixes (usually, but not always, tagged as such by the patch's author).  Upstream typically performs basic build tests, but relies on the authors of the individual fixes to test functionality.  There is typically some overlap where a previous stable kernel will receive additional fixes after a newer major series is available, but the same caveat applies.  Users and, to some extent, distributions want to treat "stable" as implying a lack of serious new bugs.  In a general sense, the stable series kernels from upstream are more stable than the base Linus kernel from which they derive, since they only take fixes on top of that kernel rather than big new features.  However, each new Linus kernel features extensive changes relative to the prior Linus kernel, any of which could be bad if its respective author did not adequately test it.

----------

