# 4.12 to 4.14, kernel panic.  Solved.

## 1clue

Hi,

I did an update, got the new linux-4.14.8-gentoo-r1 sources and decided to upgrade from 4.12.12-gentoo.

I got a kernel panic early in the boot, no logs are written. I would appreciate some help finding what I messed up.

I have an atom c2758 board using profile default/linux/amd64/17.1/no-multilib/hardened. This is the latest testing profile, it's a non-production system at the moment.

Clearly something changed from 4.12 to 4.14 and I don't know what it is.

My previous (working) config: https://paste.pound-python.org/show/bSYRIlJ8yJR9WTHj2Gjl/

My next (non-working) config: https://paste.pound-python.org/show/BNIb6Cza1u5WEtKFhqnP/

The difference between them: https://paste.pound-python.org/show/LlvYMXgHmAwU2XkjO9gm/

I recorded the console on startup, and while I can't paste the video I can type a few lines in:

```

# Does some IPMI detection

ACPI: Power Button [PWRF]

Bug: unable to handle kernel NULL pointer dereference at 0000000000000000064

IP: __kmalloc+0xce/0x1d0

PGD 0 P4D 0

Oops: 0000 [#1] SMP

Modules linked in:

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.8-gentoo-r1-k1 #2

Hardware name: Supermicro A1SRM-LN7/LN5F/A1SRM-LN7F-2758, BIOS 1.0 09/17/2014

task: ffffa102ed8e0000 task.stack: ffffab6d40010000

RIP: 0010:__kmalloc+0xce/0x1d0

RSP: 0000:ffffab6d40013c90 EFLAGS: 00010202

RAX: 00000000000000 RBX: 000000000000064 RCX: 00000000000001af

RDX: 000000000001ae RSI: 000000000000000 RDI: 000000000001d660

RBP: ffffab6d40013cc0 R08: ffffa102ecc98c00 R09: ffffffffa27a8ec5

R10: ffffd721d1af5740 R11: ffffa102ed4c935f R12: 00000000014000c0

R13: 00000000000140 R14: fffa102ed803080 R15: ffffa102ed803080

FS:  00000000000000(0000) GS:ffffa102ffc00000(0000) knlGS: 000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00000000000064 CR3: 00000001b020a000 CR4: 00000000001006f0

Call Trace:

 acpi_processor_get_throttling_info+0x445/0x630

 __acpi_processor_start+0x83/0x1d0

 acpi_processor_start+0x4d/0x60

 driver_probe_device+0x25a/0x2f0

 __driver_attach+0xaf/0xc0

 ? driver_probe_device+0x2f0/0x2f0

 bus_for_each_dev+0x6d/0xa0

 driver_attach+0x2e/0x30

 bus_add_driver+0x12f/0x230

 ? do_early_param+0xa2/0xa2

 driver_register+0x70/0xf0

 ? acpi_video_init+0x9a/0x9a 

 acpi_processor_driver_init+0x34/0xa8

 ? acpi_video_init+0x9a/0x9a

 do_one_initcall+0x5e/0x1a0

 ? do_early_param+0xa2/0xa2

 kernel_init_freeable+0x179/0x1fc

 ? rest_init+0xc0/0xc0

 kernel_init+0x1e/0x110

 ret_from_fork+0x25/0x30

Code: e7 00 00 00 49 63 46 20 49 8b 3e 48 8d 4a 01 49 8b 1c 00 49 8d 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 c6 48 85

db 74 0b 49 63 46 20 <48> 8b 04 03 0f 18 08 41 f7 c4 00 80 00 00 49 8d 18 0f 85 d1 00

RIP: __kmalloc+0x1d0 RSP: fffab6d40013c90

CR2: 000000000000064

---[ end trace 223394f177cfe3e2 ]---

Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

Kernel Offset: 0x21000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

sched: Unexpected reschedule of offline CPU#4!

-----------------[ cut here ]-------------------

WARNING: CPU: 0 PID: 1 at /usr/src/linux-4.14.8.gentoo-r1/arch/x86/kernel/smp.c: 128 native_smp_send_reschedule+0x47/0x50

Modules linked in:

CPU: 0 PID: 1 Comm: swapper/0 Tainted: G       D             4.14.8-gentoo-r1-k1 #2

blah blah blah.

...
```

There is more in that 

That exitcode seems to reference missing modules, but I built the modules and installed them before the kernel.  My command was:

```
mount /boot; make && make modules && make modules_install && make install
```

Thanks.Last edited by 1clue on Fri Feb 09, 2018 6:27 pm; edited 1 time in total

----------

## fedeliallalinea

Can be related to this?

Reference:

https://forums.gentoo.org/viewtopic-t-1074646.html

----------

## 1clue

That's odd. I don't have any keywords and this kernel just came down this morning. I did an emerge-webrsync too.

I wonder why it came down?

Thanks.

----------

## 1clue

And more importantly, why would I want to go all the way back to 4.9?

----------

## NeddySeagoon

1clue,

4.12 no longer gets security patches, so its masked.

4.14 and gentoo gcc-6.4 don't play nicely (Linus is not amused).  

That was masked while the investigation was underway. 

It turn, that makes 4.9 the current stable.

----------

## 1clue

My 4.12 was compiled before the switch. I think I'll stick it out for awhile rather than go back to the dark ages.

Thanks.

----------

## Fluxie

 *NeddySeagoon wrote:*   

> 1clue,
> 
> 4.14 and gentoo gcc-6.4 don't play nicely (Linus is not amused).  
> 
> That was masked while the investigation was underway. 
> ...

 

Could you point me where you found this bit of information?

I'm curious because I'm currently running "4.14.10-gentoo-r1" compiled with GCC-6.4.0. This combination does seem stable to me but I would like to be sure. Also I would rather not switch to 4.9 because I have a new AMD processor which doesn't play nicely with 4.9, afaik...

Exact version: "Linux version 4.14.10-gentoo-r1 (root@<masked>i) (gcc version 6.4.0 (Gentoo 6.4.0 p1.1)) #1 SMP Mon Jan 1 12:03:20 EET 2018"

Thanks:)

----------

## asturm

The issues were caused by hardened patchset. So if your kernel image works, just stick with it.

----------

## NeddySeagoon

Fluxie,

Its on the LKML.

----------

## 1clue

 *NeddySeagoon wrote:*   

> Fluxie,
> 
> Its on the LKML.

 

So evidently it's a gentoo-specific issue in the compiler? When can we expect a new compiler? With this spectre/meltdown BS I would like to get on a newer kernel.

And as far as that goes, has anyone found what kernel options we should disable or enable in light of these bugs? I know it's not a complete fix, but I don't want to shoot myself in the foot here. All I can find are some general-public 'we're working on it' crap. I have a shitload of boxes to fix. Pardon my French.

----------

## asturm

Do you use hardened profile?

----------

## 1clue

default/linux/amd64/17.1/no-multilib/hardened

----------

## toralf

4.14.11 contains the -fno-stack-check quirk, so a stable hardened Gentoo Linux compiles and boots the kernel fine (tested at my hardened server and my hardened client)

----------

## ct85711

From what I recall, towards the end of the message thread in regards to this issue, it sounded like they intend to just straight out strip stack-check and/or force no-stack-check from the compiler flags so that this issue won't be a factor.  Though who knows on what versions this change would be done on.

----------

## Hu

 *1clue wrote:*   

> So evidently it's a gentoo-specific issue in the compiler?

 No.  It's a negative interaction among:Upstream gcc implements -fstack-check in a way that the kernel developers think is ugly and questionable (but runs correctly for user code).-fstack-check will, for certain kernel functions, generate code that breaks the kernel.  For other functions, it's suboptimal and possibly wrong in a subtle way, but is not immediately system-breaking.  Unfortunately, for the functions that it breaks outright, almost everybody needs those functions to work, so you cannot avoid the problem by being lucky or disabling optional kernel features.Hardened Gentoo (not Gentoo in general, but only the hardened profiles) default-enable this feature written by upstream. *1clue wrote:*   

> When can we expect a new compiler?

 You don't need a new compiler.  You need not to generate user-mode-specific stack probes when compiling the kernel.  This can be done by not using a hardened gcc or by passing -fno-stack-check.  Per toralf's post two up from mine (and three down from yours), the latest 4.14.x will do this for you.  To quote Greg KH, "all users must upgrade."  :Wink: 

 *1clue wrote:*   

>  With this spectre/meltdown BS I would like to get on a newer kernel.
> 
> And as far as that goes, has anyone found what kernel options we should disable or enable in light of these bugs? I know it's not a complete fix, but I don't want to shoot myself in the foot here.

 As for Meltdown and Spectre, you may or may not be in a position to need the KPTI patches.  If you have a large number of machines you manage, then you probably have at least some where you allow untrusted users to run unprivileged code.  Those machines may need KPTI, depending on exactly how little you trust the users.  For a complete fix, you could switch to using unaffected CPUs.  Per the reporting I've read, pre-1995 CPUs are unaffected, as are old in-order-only Intel Atom chips.  :Wink: 

----------

## 1clue

@Hu,

Thanks much, that gives me a path.  So -fno-stack-check can be done just on the kernel, does not need to be changed in make.conf? I'll re-read the stuff above just in case it's mentioned there and I missed it.

I have zero hardware which is unaffected. You'd think I would luck out once maybe, statistically speaking.

I have no Gentoo systems with a gui or which are used by an untrusted user logging into a shell. They're all servers and security appliances and KVM/QEMU which run server VMs.

I have many more boxes and VMs which are some sort of binary distro. So I'm a bit frazzled right now. Not your problem. I'm waiting on those to see what the distro does.

My test Gentoo box has QuickAssist. It's an atom c2758. I'm scared to find out what that means with respect to Spectre and Meltdown. Fortunately enough it's been overkill for everything I've configured it for, so a loss of performance is unlikely to matter much.

----------

## Hu

The kernel build does not respect make.conf, so changing it there will not help you.  The user packages that respect it do not need it changed, so changing it would be counterproductive.  If you want to hand-apply the change for the kernel build, I believe placing the value in $KBUILD_CFLAGS will suffice (but this is from old memory, so it might be wrong; check before relying on it).

For 2, if you don't let untrusted users run arbitrary unprivileged code, your risk is lower.  I can't say it's impossible for an untrusted user to leverage the existing programs, but if they have no shell access, no permission to upload programs to run, and no permission to upload scripts to run, they will likely have a very difficult time running code that can exploit these problems due to the need to run specific sequences with tight timing tolerances.  The VM hosts could be a problem, if you have untrusted users running in the guests (including, but not necessarily limited to, untrusted users who are authorized to be root on their respective guests).  If the VMs are intended for isolation/management/redundancy purposes, rather than security enforcement against untrusted users, then they are probably fine.

I can't usefully comment on the other points.

----------

## 1clue

Reading this from my phone, I think you've given me what I need. 

The vm guests are all servers, mostly non-gui unless it's something like oracle, which wants centos.

At any rate the only people who have interesting access to any of this, either host or guest, are financially driven by the need for these systems to work correctly and have worked with me for 5 years or better. That's still no guarantee but I'll take those odds.

Thanks for your time. I'll post here later with status one way or the other.

----------

## 1clue

I'm pretty sure I have a non-working kernel based on some other config option, but I'm not sure if I'm doing no-stack-check right.

```

export KBUILD_CFLAGS='-fno-stack-check'

make clean

mount /boot

make && make modules && make modules_install && make install

grub-mkconfig > /boot/grub/grub.cfg

```

----------

## Hu

That looks right, except that I was wrong about the variable to use, and you did not catch me in it.  Looking now at the kernel sources, I think the right variable is $KCFLAGS.  Variable KBUILD_CFLAGS is used internally, and may ignore your environment setting.

The simplest option is to use a kernel that sets -fno-stack-check automatically through its build system.

----------

## 1clue

 *Hu wrote:*   

> That looks right, except that I was wrong about the variable to use, and you did not catch me in it.  Looking now at the kernel sources, I think the right variable is $KCFLAGS.  Variable KBUILD_CFLAGS is used internally, and may ignore your environment setting.
> 
> The simplest option is to use a kernel that sets -fno-stack-check automatically through its build system.

 

I'm feeling kinda stupid right now. I've been running Linux since the 90s, compiling my own kernels since about 98, and never once passed a kernel build option. My Google kung-fu is broken, getting no results.

----------

## 1clue

Still no joy.

Using kernel linux-4.14.8-gentoo-r1, hardened 17.1 profile. I can't tell looking at the 'make' output whether it took the setting or not.

Going to dig deeper into the diff between the new config and the old config, and see if there's some other reason why I bricked my kernel.

----------

## Hu

Your technique looked correct, aside from using the variable name I picked without adequate checking.  You should export KCFLAGS=-fno-stack-check, rather than export KBUILD_CFLAGS=....  You can check the make output by switching the kernel build to verbose mode.  If I recall correctly, that is make V=1 make-target.

----------

## 1clue

I've verified that the -fno-stack-check is being applied. The kernel still panics during the first second or so of boot. So it's a problem with my config, I did something stupid with the new options when I switched to 4.14.8.

----------

## mimosinnet

Have you been able to solve it? 

I am having a similar issue when moving to 4.14.8 kernel in a hardened box. This is the screenshot of the error. I can boot with SystemRescueCd and this is the kernel created with oldconfig and this is a new kernel config from scratch.

Cheers!

----------

## asturm

Don't use .8, use the latest available version 4.14.14.

----------

## 1clue

 *mimosinnet wrote:*   

> Have you been able to solve it? 
> 
> I am having a similar issue when moving to 4.14.8 kernel in a hardened box. This is the screenshot of the error. I can boot with SystemRescueCd and this is the kernel created with oldconfig and this is a new kernel config from scratch.
> 
> Cheers!

 

Still not solved. My next step is to start over with my 4.12.12 kernel config, do 'make oldconfig' and then be very careful about all the new options.

I haven't had time to mess with this for awhile, so it's been running on 4.12.12.

----------

## mimosinnet

 *1clue wrote:*   

> Still not solved. My next step is to start over with my 4.12.12 kernel config, do 'make oldconfig' and then be very careful about all the new options.

 

I have done the other way around. I have 'make oldconfig' with the .config from linux-4.14.8-gentoo-r1 (that does not boot) in linux-4.12.12-gentoo, and it boots. These are the differences between both .config files. As suggested by astrum: 

 *asturm wrote:*   

> Don't use .8, use the latest available version 4.14.14.

 

There seems to be an issue with linux-4.14.8-gentoo-r1.

Cheers!

Edit: Ups! I just have noticed that kernels 4.12.12 and 4.14.8-r1 are masked!

----------

## 1clue

 *asturm wrote:*   

> Don't use .8, use the latest available version 4.14.14.

 

FWIW I finally got it.

I abandoned 4.14.8 and did 4.14.14, worked the first time. I did 'make oldconfig' and was sure to read all the documentation before choosing an option. Did that on 4.14.8 as well, but nothing I did made a running kernel.

Thanks.

----------

## asturm

Because .8 was broken on hardened. That's why it was masked.

----------

