# Debugging kernel boot?

## uberDoward

I don't think I'm getting as far as Open-RC - I get the linux penguins along the top, with [timestamp] message displayed.

Anyway, the kernel appears to be getting hung, without actually going into a kernel panic.  Is there a way to debug the actual kernel boot process?  I find a lot about debugging Open-RC, little about the kernel's boot...

----------

## eccerr0r

Knowing what were the last things that it printed before it hangs would be useful.  A picture would help if it's difficult to type what it wrote.

Yes, it would be helpful to know if it started openRC or did it not even get that far, did it free unused kernel memory (usually one of the last things it does before handing off control to init)?

----------

## NeddySeagoon

uberDoward,

Its possible to configure the kernel with no console.  Everything still works but there is nothing on the screen.

From memory, you don't even get the tux icons.

If openrc is being started, you may be able to log in via ssh, if thats set up.

Then there is a serial console, if your system has a real serial port and you have a way to connect to it.

There is also console over network.

----------

## bunder

Is it possible to get early printk, then have it switch to a non-functional framebuffer?  If that's possible, that could in theory give the appearance of a hung boot with penguins.

With no framebuffer, you'd just get a black screen and no penguins.

----------

## NeddySeagoon

bunder,

Thats possible with the text console build in then switching to a broken framebuffer built as a module.

However - no penguins. 

Its also possible to have several framebuffers configured the preferred one could be broken.

Being lazy, I have vesafb and amdgpudrmfb.

They both work and the switch is clearly visible. Both during the boot process and in dmesg.

```
[    1.527743] vesafb: mode is 1024x768x16, linelength=2048, pages=29

[    1.527745] vesafb: scrolling: redraw

[    1.527748] vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0

[    1.527760] vesafb: framebuffer at 0xd0000000, mapped to 0xffffc90000400000, using 3072k, total 49152k

[    1.530011] fb0: VESA VGA frame buffer device

[    1.553452] fb: switching to amdgpudrmfb from VESA VGA
```

----------

## uberDoward

Using ASPEED, no modules (going for a very lean monolithic kernel).  Framebuffer appears to switch over - now you'e got me wondering if I forgot to put in the console LOL, let me check that.

----------

## uberDoward

I'll get a picture shortly.  TTY is compiled into the kernel, let me check the framebuffers enabled.... are there specific options I need to have enabled?

----------

## eccerr0r

I'm confused, you did say the penguins showed up as well as the [0.2512512] blahblah timestamped kernel messages?

I wasn't sure if you were asking a general question or trying to fix your specific box...  If you did see the penguins and kernel messages, at least there's some clues to figure out what's going on.

On the other hand, black screen boots with no penguins are most annoying to debug.

----------

## NeddySeagoon

uberDoward,

Make friends with wgetpaste.

Put your lspci output in a post together with a link to your grub.cfg, or whatever your boot loader config file is and a link to your kernel .config file.

----------

## uberDoward

NeddySeagoon, that's freaking awesome!

config: https://paste.pound-python.org/show/7eCSAtiIBY3XYcLrhLUY/

grub.cfg: https://paste.pound-python.org/show/P69GrwzUOs0R54dWhAOY/

Picture of boot: https://imagebin.ca/v/3ht5U779oAg3

----------

## NeddySeagoon

uberDoward,

What we know so far ...

Grub loads the kernel and the kernel starts with a framebuffer console but we don't know which one.

UVESA is no longer complete, so Its not that one.

The help text on 

```
CONFIG_DRM_AST

Say yes for experimental AST GPU driver. Do not enable this driver without having a working -modesetting, and a version of AST that knows to fail if KMS is bound to the driver. These GPUs are commonly found in server chipsets.
```

Some/all DRM kernel drivers provide a free framebuffer, so at a guess, the kernel starts with the VESA framebuffer then switches to a broken AST framebuffer. Hence the warning about  having a working -modesetting.

As a test, add nomodeset to your kernel command line(s) in grub.cfg.  $EDITOR will be fine meanwhile.

You won't like the result but we should get more information. 

Your lspci output will still be useful.

----------

## eccerr0r

Since the min installer CD booted( ? - what boot media did you use? ) using that kernel's .config as a starting point would be helpful.  Agreed your lspci info would be useful.

Other key things that are "interesting": It took almost 20 seconds for things to settle down.  A lot of these messages may have shown up asynchronously.

Some things I'd try to help debug(but not "fix"): Removing USB drivers (so we don't see all the USB async stuff at the expense of no keyboard/mouse, but at least  less things will scroll off).   Does shift-pageup scroll back show anything interesting (which won't work if your usb/keyboard isn't compiled, but, hey...)

Does /dev/sde4 sound like the proper root disk?

(Incidentally, I hate the penguins. I never compile that in because it hides 3-4 lines of screen real estate for kernel boot debug  :Smile: )

----------

## NeddySeagoon

eccerr0r,

There is also framebuffer rotate to get more lines on the screen.

The kernel normally mounts root before USB is initalised, so I think that root is mounted but it might still be read only.

-- edit --

We can try Interactive mode for openrc too.

Edit  /etc/rc.conf 

Find the line that says 

```
#rc_interactive="YES" 
```

and remove the # at the start.  Save the change.

Reboot normally.  As soon as you see the Penguins, press and hold the 'I' key.

Openrc will stop and ask about each service that it wants to start.

This happens before your keymap has been set, so its the 'I' key on the USA QWERTY keyboard layout.

Depending on your keymap, that might matter.  I use dvorak-uk. It matters to me.

This will tell if openrc gets started or not.

----------

## uberDoward

lspci -nnk output:

https://paste.pound-python.org/show/OzuhINTSLRkDa5SYIt4z/

Let me recompile without the penguins, lol - I've tried interactive mode, but to no avail.

/dev/sde4 is rootfs, /dev/sde2 is boot.  /dev/sde is the OS drive (32GB ssd)

I'll pull out the USB stuff, see if anything else helpful comes up  :Smile: 

----------

## NeddySeagoon

uberDoward,

```
01:09.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 10)

   Subsystem: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000]
```

That's interesting for what it doesn't say.  Look no kernel module. 

Try turning off 

```
CONFIG_DRM_AST
```

 Whatever was driving the console when you posted lspci, it wasn't that.

What does dmesg have to say about the console driver?

```
$ dmesg | grep -B2 Console

[    0.000000]    Tasks RCU enabled.

[    0.000000] NR_IRQS: 4352, nr_irqs: 472, preallocated irqs: 16

[    0.000000] Console: colour dummy device 80x25

--

[    1.560247] vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0

[    1.560259] vesafb: framebuffer at 0xd0000000, mapped to 0xffffc90000400000, using 3072k, total 49152k

[    1.561408] Console: switching to colour frame buffer device 128x48

--

[    1.585905] checking generic (d0000000 3000000) vs hw (d0000000 10000000)

[    1.585905] fb: switching to amdgpudrmfb from VESA VGA

[    1.585929] Console: switching to colour dummy device 80x25

--

[    2.611092] [drm]    pitch is 10240

[    2.611125] fbcon: amdgpudrmfb (fb0) is primary device

[    2.828946] Console: switching to colour frame buffer device 320x90
```

----------

## uberDoward

Console grep from LiveCD: https://paste.pound-python.org/show/wiZ4B8lXGzUipTJnRiud/

Very interesting - let me remove the AST module, and see if there's a vesafb I don't have set up in the kernel...

*edit*

Latest kernel .config : https://paste.pound-python.org/show/Vb7OLaGwv2LfxgsAxbtc/

----------

## uberDoward

Ok, weird.  No more high res VESA, but got a kernel panic (VFS: unable to mount root fs on unknown block(8,6 :Cool: )

Looking into it, noticed that hitting grub's command line, everything was pointing @ hd4,gpt2.  ls hd0,gpt2, however was the correct one.  Now, why would the BIOS decide to change hd0?  No idea, I didn't change anything in there (I have first HDD boot as my /dev/sde 32GB SSD).

So manually editing grub via 'e' to point to (hd0,gpt4) and /dev/sda4 for the kernel command line yields: https://imagebin.ca/v/3i0X9x50yDLx

----------

## eccerr0r

One thing that gets tripped up frequently is that the (hdX,Y) in grub has no relationship to /dev/sdX in Linux.  Ideally they map directly but no.  Also the (hdX,Y) do nothing for the kernel, it's only for grub to locate boot images (kernel, initramfs).

Anyway, interesting, so if you do have root=/dev/sde4 it behaves differently than if you have root=/dev/sda4?  That would imply the kernel did end up switching over to init?

----------

## NeddySeagoon

uberDoward,

Grub sees block devices as enumerated by the motherboard firmware.

The kernel sees them in PCI bus order as it scans the PCI bus but its not that simple either.

'Built in' are always  enumerated before modules.

It gets worse. The kernel may use several threads for  PCI enumeration, so there is a possibility of a race.

The race can change the HDD order from kernel build to kernel build or if you are really unlucky, from boot to boot.

The point is that there is no deliberate correlation between what grub sees and the kernel sees.

Use root=PARTUUID=<your_root_PARTUUID> in place of root=/dev/sd..

blkid will show all the PARTUUIDs.  Google for the exact syntax.

It won't matter what device root is on, the kernel will find it.

The same thing will upset /etc/fstab.  You can use UUID or PARTUUID there, so that dynamic device renaming is harmless.

----------

## uberDoward

So how do I make the UUID stick to a grub-mkconfig?

Note also, no initramfs here on startup.  I've just tried modifying /boot/grub/grub.cfg by hand, though, so let's see what happens, lol

----------

## uberDoward

So I got it to boot by setting /dev/sdg4 after editing the grub menu ('e' @ grub entry).

Something somewhere is happily screwing up my boot device names.  I couldn't even boot with root=PARTUUID=<partuuid> so I'm really at a loss.

----------

## eccerr0r

When you made grub-mkconfig it generated the /dev/sdXX and not PARTUUIDs?  Changed /etc/default/grub ?

I suppose if you have a lot of hard drives and they may change around randomly, it might be worth it to make sure PARTUUID works, or use initramfs that supports UUID.

----------

## NeddySeagoon

uberDoward,

The kernel understands PARTUUID - its a property of the partition.

To use UUID, which is a property of a filesystem, you must use an initrd that contains the userspace mount binary. 

```
/sbin/blkid 

/dev/sda1: UUID="9392926d-6408-6e7a-8663-82834138a597" TYPE="linux_raid_member" PARTUUID="0553caf4-01"

/dev/sde1: UUID="c400b18c-0210-4338-a0fd-f437ecbaaf99" TYPE="ext4" PARTLABEL="ext4" PARTUUID="150e6ef1-7ba8-409c-9c3f-dbdecdc9f18b"
```

sda is MSDOS and sde is GPT.

Notice that for GPT, the UUID and PARTUUID look similar but you need to use the right one.

----------

## uberDoward

Yeah, it refused to work via UUID.  

I'm not using an initramfs - I thought the kernel should boot anyway.

For now, though, it's fixed.  I re-did the grub-mkconfig -o /boot/grub/grub.cfg after booting the manually altered grub to get into my Gentoo system, and all is good now.

Very odd behavior - wish I had the time to figure out what went wrong and fix it, though.

----------

## NeddySeagoon

uberDoward,

You write UUID - which cannot work, instead of PARTUUID.

Did you test with UUID or PARTUUID in grub.cfg?

----------

