# Kernel Hangs at Loading

## jyoung

Hi Folks,

After upgrading from a 4.12.12 kernel, the new kernel (4.13.15, 4.14.8-r1) hangs while grub is loading it. All I get this message:

Loading Linux 4.13.15-gentoo ...

Grub is detecting it, since if I move or delete the kernel grub complains that it can't find it. I've tried numerous kernel options (make oldconfig from the working kernel's config, genkernel, etc.), but the result is the same. This problem is particularly hard to debug since the result is a non-working system. Some web searching turns up folk with similar problems, but usually grub complains with some kind of error message (can't find the kernel, can't load the kernel, etc.). Here's, it's just silent.

Any ideas?

----------

## Hu

First, check that the kernel is supposed to print output: no command line options silenced it, that it is writing to the correct output device, etc.

Second, try to find the origin of the failure.  Set aside your known-good 4.12.12 kernel.  Clean your kernel build area and build a new 4.12.12 kernel from the same configuration, to rule out toolchain changes that may have broken something.  If the newly built 4.12.12 also works, then we can assume a kernel source code change is your problem.  If the old 4.12.12 works and the new one fails, we can assume there is a toolchain problem.

The rest of this post is written on the assumption it is a source change, not a toolchain change.  If the toolchain is implicated, stop here and post back with your findings.  Otherwise, read on.

Kernel 4.12.x went up to 4.12.14 before being retired.  You could test later 4.12.x kernels in the hope that one of them is broken.  If so, it will be comparatively easy to find the bad patch since only a few hundred commits are in question.  If all 4.12.x work, and no 4.13.x work, then finding the bad commit is much more tedious.  In either case, you will use git bisect to test intermediate kernels to find the first that fails to boot.  If 4.12.14 is bad, you can probably find it in ~log2(78) steps.  If 4.12.14 is good and 4.13 is bad, you may need ~log2(14150) steps.

----------

## NeddySeagoon

jyoung,

```
Loading Linux 4.13.15-gentoo ..
```

is the last message from grub.  Look at your grub.cfg.

The first output from the kernel is

```
Decompressing Linux... 
```

The kernel needs to have the right decompressor built in, or you don't get any messages and the kernel can't decompress itself.

The file /usr/src/linux/vmlinux is the uncompressed kernel.  

Booting that may work but I never tried an uncompressed kernel on amd64.

----------

## jyoung

The clean 4.12.12 doesn't have this issue, and neither does the 4.12.14 kernel. I  think this rules out a  toolchain problem, and points toward a kernel source change between 4.12 and 4.13. To confirm, I'll  attempt to compile an earlier 4.13.

I'm also interested in the idea of booting off an uncompressed kernel.  When I simply copy vmlinux to /boot (along with the other kernels), grub-mkconfig doesn't seem to detected it. Should I modify grub.cfg for this experiment (despite the warning)?

----------

## NeddySeagoon

jyoung,

Read the bottom of grub.cfg.

The warning is because manual edits will be removed when grub.cfg is regenerated.

Don't break any of your your working grub.cfg entries.

----------

## jyoung

I guess I'd never looked at the bottom of grub.cfg! I've setup a manual  config file in custom.cfg; grub now detects the uncompressed kernel, but I'm  getting 

"error: invalid magic number"

Below is the contents of custom.cfg. I've mostly copied the setup from the menu entries in grub.cfg

menuentry 'uncompressed'  {

                if [ x$feature_platform_search_hint = xy ]; then

                  search --no-floppy --fs-uuid --set=root  39ec98c1-c234-4ee3-bb13-6d0d5f84b1be

                else

                  search --no-floppy --fs-uuid --set=root 39ec98c1-c234-4ee3-bb13-6d0d5f84b1be

                fi

                echo "Loading linux..."

                linux   /boot/vmlinux root=/dev/nvme0n1p4 ro rootfstype=ext4 

}

One the other front,  the  4.13.5 kernel (which doesn't work) is the lowest 4.13 kernel available.

----------

## jyoung

I also get the magic number error if I try to boot off the compressed kernel using the custom menu option.

----------

## Jaglover

Did you run make clean before you ran make? Maybe you should.

----------

## ipic

```
Loading Linux 4.13.15-gentoo ..    then nothing
```

I had this a while back, and the cause was that I hadn't spotted that I had filled up my boot partition. The copy of the kernel images to the boot partition failed silently, truncating the image. Grub found it OK (since the file existed), but it just went nowhere on load.

Probably not your problem but I thought I'd mention it just in case...

Regards

Ian

----------

## jyoung

Just to be sure, I ran make clean in 4.13.5 and rebuilt it. The results were unchanged.

My system is setup with /boot/grub/efi on a separate partition, but /boot/grub on the root partition. So, I'm definitely not running out of space for the new kernels.

----------

## xpxp2002

I'm also experiencing this issue. Upgrading from 4.12.12 to 4.14.8-r1. Used oldconfig to bring .config current. Booting off of 4.14.8-r1 freezes at the loading line out of Grub. 4.12.12 works just fine.

I think this new 4.14 branch still has some issues.

----------

## xpxp2002

Been working on this for hours since Monday. Just figured it out...for my system, at least.

Try setting CONFIG_PGTABLE_LEVELS=4 if it is set to 5.

----------

## jyoung

xpxp2002, CONFIG_PGTABLE_LEVELS=4 in my .config.

It looks like it's set by arch, without a prompt in menuconfig. I tried setting to to 5 manually (just to see what would happen), but 'make' immediately rewrote .config with 4.

----------

## Jaglover

This may be at least partially a gcc-6 problem. Have you tried with gcc-7.

----------

## NeddySeagoon

jyoung,

A word of advice on editing the kernel .config by hand.  Don't.

Its very easy to end up with an illegal .config that produces a horribly broken kernel.

Then its difficult to diagnose because nobody has seen anything like it before.

The problem stems from a single menuconfig entry flipping lots of .config file flags.

----------

## jyoung

Yeah, I would never try this on a kernel that I actually needed ... I'm still running off 4.12.12 while I work on 4.14.8-r1.

I'm using gcc-7.2.0; in fact, that's the only one I'm seeing in gcc-config -l.

----------

## xpxp2002

 *jyoung wrote:*   

> xpxp2002, CONFIG_PGTABLE_LEVELS=4 in my .config.
> 
> It looks like it's set by arch, without a prompt in menuconfig. I tried setting to to 5 manually (just to see what would happen), but 'make' immediately rewrote .config with 4.

 

Hmm. Sorry that didn’t work for you. I’ve been trying to get mine to work four days.

I’m on amd64. What arch is this?

----------

## NeddySeagoon

jyoung,

It looks like 4.14 is more trouble than its worth

----------

## jyoung

I'm also on amd64.

This problem also occurred with 4.13.5, but I see that that's already been removed.

Shall we mark this thread as solved? It's not really solved ... I'm certainly willing to try again and report back once a 4.15 kernel is released.

----------

## Hu

Why wait?  4.15 is already up to -rc5.  Linus usually releases around -rc7 or -rc8, depending on how he feels about overall quality.  You might not want to stay on a -rcN kernel for daily work, but for a quick test, it's probably safe.

----------

## NeddySeagoon

Hu,

I've been using it since 4.15.0-rc1 as it has a new amdgpu driver.

I'm on rc-4 now and I can say that it seems to work for me.

----------

## jyoung

I've justed tried 4.15_rc5, and the result is the same. The boot sequence stuck at  "Loading Linux " bit.

You folks are getting your kernels from git-sources, yes?

----------

## NeddySeagoon

jyoung,

I fetched the kernel from kernel.org but git-sources is the Gentoo way to do the same thing.

Lets start from the very beginning.  Post your

```
lspci -nnk
```

ouput.

Pastebin your non working 4.15.0-rc5 .config. Then I can drop it into the kernel and look at it.

Explain your filesystems in use. Particularly root, where it is, what it is and any hoops you need to jump through to mount it.

----------

## jyoung

Okay, here's a link to my .config

https://pastebin.com/4d4T1pms

This is a fresh .config, generated by running make menuconfig and then exiting without making any changes. I've also attempted using make oldconfig off the 4.12.12 kernel.

My drive is NVME, so it uses EFI. Also, instead of the partitions being name /dev/sda#, they're /dev/nvme0n1p#

/dev/nvme0n1p1  boot bios partition

/dev/nvme0n1p2  boot partition, mounted at /boot/grub/efi

/dev/nvme0n1p3  swap

/dev/nvme0n1p4  root partition

/dev/nvme0n1p5  home partition

Curiously, running df tells me that the boot and home partitions are mounted as expected, but the root path (/) corresponds to /dev/root instead of /dev/nvmen1p4.

On significant issue is that I had to enable the NVME items in the kernel in order to properly load the drive. I doubt that is the problem here for two reasons: 1) I've also tried enabling them in 4.14, and the problem remained, and make oldconfig from 4.12.12 would have enabled them too, and 2) before I enabled the proper NVME drivers in 4.12.12, the kernel loaded itself and started the boot process, failing at a later step, while here it's not even loading. All that said, I'd be happy to mess around with the NVME stuff some more if you folks think it's likely.

Running 

```
lspci -nnk
```

 produces

```

00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers [8086:1904] (rev 08)

   Subsystem: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers [8086:2015]

   Kernel driver in use: skl_uncore

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 520 [8086:1916] (rev 07)

   Subsystem: Microsoft Corporation HD Graphics 520 [1414:0015]

   Kernel driver in use: i915

   Kernel modules: i915

00:05.0 Multimedia controller [0480]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Imaging Unit [8086:1919] (rev 01)

   Subsystem: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Imaging Unit [8086:2015]

00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model [8086:1911]

   Subsystem: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th Gen Core Processor Gaussian Mixture Model [8086:2015]

00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:7270]

   Kernel driver in use: xhci_hcd

   Kernel modules: xhci_pci

00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Thermal subsystem [8086:9d31] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP Thermal subsystem [8086:7270]

   Kernel driver in use: intel_pch_thermal

   Kernel modules: intel_pch_thermal

00:14.3 Multimedia controller [0480]: Intel Corporation Device [8086:9d32] (rev 01)

   Subsystem: Intel Corporation Device [8086:7270]

00:15.0 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 [8086:9d60] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:7270]

   Kernel driver in use: intel-lpss

   Kernel modules: intel_lpss_pci

00:15.1 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 [8086:9d61] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:7270]

   Kernel driver in use: intel-lpss

   Kernel modules: intel_lpss_pci

00:15.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #2 [8086:9d62] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:7270]

   Kernel driver in use: intel-lpss

   Kernel modules: intel_lpss_pci

00:15.3 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #3 [8086:9d63] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:7270]

   Kernel driver in use: intel-lpss

   Kernel modules: intel_lpss_pci

00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-LP CSME HECI #1 [8086:9d3a] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP CSME HECI [8086:7270]

   Kernel driver in use: mei_me

   Kernel modules: mei_me

00:16.4 Communication controller [0780]: Intel Corporation Device [8086:9d3e] (rev 21)

   Kernel driver in use: mei_me

   Kernel modules: mei_me

00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 [8086:9d14] (rev f1)

   Kernel driver in use: pcieport

   Kernel modules: shpchp

00:1d.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 [8086:9d18] (rev f1)

   Kernel driver in use: pcieport

   Kernel modules: shpchp

00:1f.0 ISA bridge [0601]: Intel Corporation Sunrise Point-LP LPC Controller [8086:9d48] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP LPC Controller [8086:7270]

00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-LP PMC [8086:9d21] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP PMC [8086:7270]

00:1f.3 Audio device [0403]: Intel Corporation Sunrise Point-LP HD Audio [8086:9d70] (rev 21)

   Subsystem: Intel Corporation Sunrise Point-LP HD Audio [8086:7270]

   Kernel driver in use: snd_hda_intel

   Kernel modules: snd_hda_intel, snd_soc_skl

01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 [144d:a802] (rev 01)

   Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 [144d:a801]

   Kernel driver in use: nvme

02:00.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless [11ab:2b38]

   Subsystem: Device [0003:045e]

   Kernel driver in use: mwifiex_pcie

   Kernel modules: mwifiex_pcie

```

----------

## NeddySeagoon

jyoung,

Here's how booting works. It solves the problem of loading an operating system from the block device without being able to read the filesystem on the block device.

This is for BIOS, not EUFI, but the problems are the same  EUFI can read exactly one filesystem - vfat,  So everything needed to get started has to be there.

The BIOS can read exactly one disk block. That's LBA 0. When it starts, it does all the POST checks, sets up the hardware, loads LBA 0 into RAM and jumps to its start address.

LBA 0 contains at most 446 bytes of code. All it can do is make BIOS calls to load some more disk blocks into RAM ... and jump to the start address.

So we have a chain of loaders, each more capable than the last. Eventually, grub gets loaded, by reading the filesystem its on and shows you a menu.

When you make your choice, Grub shows the message about loading the kernel and if you have an initrd, about loading the intrd. 

You don't report the initrd message - so lets assume you don't have one.

Grub exits by jumping to the kernel start address.  The kernel binary is all alone at this time, it can't load any modules until root is mounted as they are in /lib/modules

The kernel decompresses itself and as it starts, it puts a message on the display that you don't see.  Its a good idea to have all the modules used in lspci configured as <*> in the kernel.

```
CONFIG_DRM_I915=m
```

is for your Intel framebuffer driver. It will start after root is mounted bot is otherwise OK.

Under

```
 # Frame buffer hardware drivers
```

only 

```
CONFIG_FB_EFI=y

# CONFIG_FB_SIMPLE is not set
```

may be enabled.  The others all fight over the hardware and no display driver works.

```
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
```

Your display is in portrait mode?

```
xhci_hcd. CONFIG_USB_XHCI_HCD=m
```

 No USB 3 until root is mounted.  It may not matter but if you had an initrd, this may prevent you interacting with the rescue shell.

Ahhh. 

```
Sunrise Point-LP
```

Thats in the middle of everything on your system.  Everything  Sunrise Point related must be built into the kernel.

```
CONFIG_MFD_INTEL_LPSS=m

CONFIG_MFD_INTEL_LPSS_PCI=m

CONFIG_INTEL_MEI_ME=m
```

All need to be built in.

Change 

```
CONFIG_HOTPLUG_PCI_SHPC=m
```

to built in too.

I don't expect it to boot after those changes but we might get some debug information.

As a rule of thumb, everything needed to get the root filesystem mounted needs to be built in. Other things can be left as modules.

The drivers for your Sunrise Point chipset come under the heading needed to get the root filesystem mounted.

-- edit --

Some of those settings will only be available as off or <M>.  You will need to go back up the menu system an change the menu(s) from <M> to <*> to be able to select built in.

----------

## toralf

 *jyoung wrote:*   

> I've tried numerous kernel options (make oldconfig from the working kernel's config, genkernel, etc.), but the result is the same. 

 I made really good experiences in the past few years especially at my headless server to run "make distclean; make defconfig; make menuconfig" (the later to compile in the file system you need for / and /boot and to choose the right network card). If that kernel boots, then you can strip down the .config to the desired one. And for a quick test of your kernel .config wrt to modularization - you could deactivate that option and all "m" will be an "y" then.

----------

## jyoung

I've made the following alterations:

```

CONFIG_HOTPLUG_PCI_SHPC=y

CONFIG_MFD_INTEL_LPSS=y

CONFIG_MFD_INTEL_LPSS_ACPI=y

CONFIG_MFD_INTEL_LPSS_PCI=y

```

No change, though. Do you think that I should deactivate all the options under Frame Buffer devices except CONFIG_FB_EFI and related items?

And the other question, yes, this is a tablet so I actually do use portrait mode quite a lot.

----------

## NeddySeagoon

jyoung,

Tablet, as in System on a Chip (Soc) tablet?

Please tell more about the hardware, the make and model.  

Simple framebuffer and EFI framebuffer are harmless.

So yes, the others should be off but as they are loadable modules, they can't be loaded until root is mounted, unless you have an initrd, and root is not being mounted as far as we can tell.

The inference is that that's not your problem yet.

The very early kernel messages can't use the framebuffer as its still compressed (or being decompressed) as the kernel starts.

SoC hardware often uses the I2C bus for all sorts of setup too.

----------

## jyoung

Hi,

It's a surface pro 4 with a core i5 processor. My understanding is that to be  Soc tablet, other devices (like the wifi adapter and the video card) would have to be on the same chip, which I don't think is the case. I they appear as distinct devices with lspci, although I suppose that might not be a mutually exclusive statement.

----------

## NeddySeagoon

jyoung,

The Wifi is likely to be a physically separate unit because of regulatory issues.

It makes it easy to change from one regulatory domain to another.

The video will be on the same piece of silicon as the CPU. it will still show up in lspci as being on a PCI bus because it is :)

Everything else that does not need to be localised for assorted regulations will be on the SoC silicon too.

Your lspci won't have changed. Please pastebin your current kernel .config again.

----------

## krinn

You keep saying grub, but the config entry you show is aiming grub:2 and not grub:0

I'm not using it myself, but saw plenty times grub:2 strength comes from loading proper modules to handle an option...

Maybe you didn't just properly setup grub:2 to use any modules, and loading a kernel from ext4 may need proper ext4 support in grub:2 prior grub:2 is able to read anything from an ext4 partition...

from this https://wiki.gentoo.org/wiki/GRUB2#Chainloading you can see even a simple chainloading need grub:2 to have the proper "insmod part_msdos insmod chainload" modules set. Which also show that even the common msdos partition type is unknown to grub:2 if you don't use the part_msdos module.

And i think you might use your old kernels with proper grub:2 entry set, and you may keep trying to kick off your "test" kernels from a maybe "not valid" grub entry.

Which lead to grub:2 unable to use the kernel you are telling it to kick off.

I would expect grub:2 to throw you an error, but as i say, i'm not using grub:2 and don't know how it react.

So if we only take your error: it's after loading kernel, but prior seeing uncompressing ...

It's logical that if grub has to say something, it must say it prior grub is no more running, so it mean "saying Loading kernel" before actually really loading the kernel. Using the same logic, kernel must be running prior to be able to say anything.

You keep assuming kernel is in trouble because kernel is loaded, while kernel may not be loaded, and your assumption is only base on the fact you see grub saying "loading the kernel".

----------

## jyoung

Hi Folks,

Sorry I've been so quiet. This particular computer is used for work, and I haven't been able to mess around with the kernel and such...

For the time being, I'm going to have to let this project rest.

----------

