# CMCI storm detected: switching to poll mode

## P.Kosunen

```
Sep  2 18:03:54 shuttle kernel: [708926.214851] CMCI storm detected: switching to poll mode

Sep  2 18:03:54 shuttle kernel: [710011.679941] INFO: rcu_sched self-detected stall on CPU

Sep  2 18:03:54 shuttle kernel: [710011.679948] ^I0-...: (2 GPs behind) idle=cc6/140000000000001/0 softirq=10536764/10536766 fqs=0 

Sep  2 18:03:54 shuttle kernel: [710011.679949] ^I (t=1294978 jiffies g=5107773 c=5107772 q=62979)

Sep  2 18:03:54 shuttle kernel: [710011.679952] rcu_sched kthread starved for 1294978 jiffies! g5107773 c5107772 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

Sep  2 18:03:54 shuttle kernel: [710011.679954] rcu_sched       S15088     8      2 0x00000000

Sep  2 18:03:54 shuttle kernel: [710011.679959] Call Trace:

Sep  2 18:03:54 shuttle kernel: [710011.679968]  ? __schedule+0x1ef/0x430

Sep  2 18:03:54 shuttle kernel: [710011.679970]  ? schedule+0x2d/0x80

Sep  2 18:03:54 shuttle kernel: [710011.679971]  ? schedule_timeout+0xf3/0x170

Sep  2 18:03:54 shuttle kernel: [710011.679975]  ? mod_timer+0x180/0x180

Sep  2 18:03:54 shuttle kernel: [710011.679977]  ? rcu_accelerate_cbs+0x36/0x190

Sep  2 18:03:54 shuttle kernel: [710011.679978]  ? rcu_gp_kthread+0x489/0x7b0

Sep  2 18:03:54 shuttle kernel: [710011.679981]  ? prepare_to_swait_event+0x1a/0x40

Sep  2 18:03:54 shuttle kernel: [710011.679982]  ? rcu_gp_kthread+0x489/0x7b0

Sep  2 18:03:54 shuttle kernel: [710011.679984]  ? kthread+0xf2/0x130

Sep  2 18:03:54 shuttle kernel: [710011.679986]  ? synchronize_rcu_expedited+0x10/0x10

Sep  2 18:03:54 shuttle kernel: [710011.679987]  ? kthread_create_on_node+0x40/0x40

Sep  2 18:03:54 shuttle kernel: [710011.679989]  ? ret_from_fork+0x22/0x30

Sep  2 18:03:54 shuttle kernel: [710011.679993] NMI backtrace for cpu 0

Sep  2 18:03:54 shuttle kernel: [710011.679996] CPU: 0 PID: 2491 Comm: cw_process Not tainted 4.12.5-gentoo #2

Sep  2 18:03:54 shuttle kernel: [710011.679997] Hardware name: Shuttle Inc. DX30D/FDX30, BIOS 1.02 02/15/2017

Sep  2 18:03:54 shuttle kernel: [710011.679997] Call Trace:

Sep  2 18:03:54 shuttle kernel: [710011.679998]  <IRQ>

Sep  2 18:03:54 shuttle kernel: [710011.680002]  ? dump_stack+0x46/0x61

Sep  2 18:03:54 shuttle kernel: [710011.680004]  ? nmi_cpu_backtrace+0x8a/0x90

Sep  2 18:03:54 shuttle kernel: [710011.680006]  ? irq_force_complete_move+0xe0/0xe0

Sep  2 18:03:54 shuttle kernel: [710011.680008]  ? nmi_trigger_cpumask_backtrace+0x86/0xc0

Sep  2 18:03:54 shuttle kernel: [710011.680009]  ? rcu_dump_cpu_stacks+0x88/0xc1

Sep  2 18:03:54 shuttle kernel: [710011.680011]  ? rcu_check_callbacks+0x642/0x780

Sep  2 18:03:54 shuttle kernel: [710011.680013]  ? update_wall_time+0x474/0x720

Sep  2 18:03:54 shuttle kernel: [710011.680015]  ? update_process_times+0x23/0x50

Sep  2 18:03:54 shuttle kernel: [710011.680016]  ? tick_sched_timer+0x3d/0x130

Sep  2 18:03:54 shuttle kernel: [710011.680018]  ? __hrtimer_run_queues+0xb5/0x120

Sep  2 18:03:54 shuttle kernel: [710011.680019]  ? hrtimer_interrupt+0x9d/0x1e0

Sep  2 18:03:54 shuttle kernel: [710011.680022]  ? smp_trace_apic_timer_interrupt+0x59/0x90

Sep  2 18:03:54 shuttle kernel: [710011.680024]  ? apic_timer_interrupt+0x7f/0x90

Sep  2 18:03:54 shuttle kernel: [710011.680024]  </IRQ>

Sep  2 18:03:54 shuttle kernel: klogd 1.5.1, ---------- state change ---------- 

Sep  2 18:03:54 shuttle kernel: Loaded 57659 symbols from 13 modules.

Sep  2 18:03:54 shuttle kernel: [710011.682343] Hangcheck: hangcheck value past margin!

Sep  2 18:09:23 shuttle kernel: [710340.172989] CMCI storm subsided: switching to interrupt mode
```

Got this error with new Shuttle XPC Slim DX30 computer with Intel Celeron J3355 CPU and Corsair 8GB memory kit (CMSO8GX3M2C1600C11). Is this incompatible or broken memory problem or something else? Clock was several hours wrong and couldn't reboot cleanly next morning.

----------

## eccerr0r

It's possible it's bad memory, also possible bad CPU.  CMCI is usually a hardware problem, and likely you may have to RMA the machine...  You may want to try other memory configurations, or perhaps muck with overclocking options to see if it will go away.

There's also a possibility of bad firmware that needs to be addressed.  See if there's a firmware update.

Kernel is still a possibility but rare if it works on other machines.

----------

## P.Kosunen

BIOS is latest. Some OS selection is set to Windows in UEFI/BIOS because it also controls UEFI vs. legacy BIOS switching.

I updated system and kernel to 4.13.0 and switched clocksource to hpet, no issues since. Might be too early to tell, but let's hope it was 4.12.5 kernel or other software problem.

Edit: Disabling intel_idle from kernel seems to be workaround for this problem. Need to test different intel_idle.max_cstate levels...

Edit2: Different machine with Celeron J3455 and Void Linux, CMCI storm does not happen with "processor.max_cstate=1 intel_idle.max_cstate=0" kernel boot options.

CMCI storms usually happen when copying data from local SSD to NAS at >100MB/s (full gigabit network load). Might not be faulty hardware because same issue is in 2 different boxes.

----------

