# kernel 5.15.x breaks root on DHCP+NFS

## mortonP

Hi...

Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x

I boot Gentoo via a kernel that mounts its root fs from NFS:

CONFIG_CMDLINE="ip=dhcp root=/dev/nfs nfsroot=192.168.x.x:/gentoo,tcp,vers=4.1 ...."

and everything worked so far fine.

Except, with 5.15 this no longer works, the kernel hangs at boot and after some time emits

"VFS: Unable to mount root fs via NFS"

I spent now half a day trying every kernel from 5.10: 5.11, 5.12, 5.13, up to 5.14.21, they all work fine.

5.15.x fails; I tried various combinations of "new" kernel config options

5.16-rc7 fails, too

Looking at the DHCP server log I don't see the DHCP query from the booting kernel before mounting the NFS root, which would explain the hang - there's no network coming up.

So I suspect more it is kernel's DHCP IP autoconfig fails instead of an NFS mount fail.

Still, I see 5.15 brought "exciting NFS changes", maybe these NFS core changes broke something?

Surely someone would have noticed this failing since 5.15 has been out already for a while?

I'm running out of ideas how to debug this further and get 5.15 running....

...do you know of something to google for or try?

Thank you!

----------

## alamahant

Do you have

```

CONFIG_CMDLINE_BOOL=y

```

?

----------

## mortonP

 *alamahant wrote:*   

> Do you have
> 
> ```
> 
> CONFIG_CMDLINE_BOOL=y
> ...

 

Yes.

----------

## NeddySeagoon

mortonP,

Pastebin your 5.15 kernel .config file please.

Check your dhcp server log for signs that an IP was requested and offered.

----------

## Hu

Can you drop into an initramfs rescue shell, and look around to determine what is and is not working?  You wrote at the beginning that you don't see the DHCP query in the DHCP server log.  Can you collect a network packet capture, to confirm that the query was never even sent to the system running the DHCP server?

----------

## Anon-E-moose

There's a good chance that either an option has changed or been added for nfs related stuff, I'd check that whole subsystem, rather than use defaults from 5.10

----------

## mortonP

I figured it out, by basically brute-force bisecting the .config changes between 5.14 and 5.15 - much of the options I do not really understand what they do.

5.14:

│ Symbol: E1000E [=y]

│ Type  : tristate

│ Defined at drivers/net/ethernet/intel/Kconfig:58

│   Prompt: Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support

│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_INTEL [=y] && PCI [=y] && (!SPARC32 || BROKEN [=n])

5.15:

│ Symbol: E1000E [=y]

│ Type  : tristate

│ Defined at drivers/net/ethernet/intel/Kconfig:58

│   Prompt: Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support

│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_INTEL [=y] && PCI [=y] && (!SPARC32 || BROKEN [=n]) && PTP_1588_CLOCK_OPTIONAL [=y]

There is no initramfs.

Kernel image itself runs DHCP and mounts rootfs via NFS and does normal boot as if from local disk.

This is only possible if all necessary drivers are compiled into kernel image - including network devices.

5.14 -> 5.15 for Intel NICs it gets an additional option && PTP_1588_CLOCK_OPTIONAL which is not =y by default.

So the e1000e also automatically becomes =M and so the kernel image loses networking....

...oops

I don't know how to feel about this now, spent 2 days debugging this.

But I learned again something, and I hope you too.

Sorry for bothering - in retrospect the symptoms and the cause absolutely make sense...

----------

## Hu

Perhaps there should be an initramfs, so you can drop in and look around when things don't work.  :Wink: 

Do you even need this kernel to have CONFIG_MODULES=y?  If not, consider disabling module support, which might encourage oldconfig to behave better when next this kind of thing happens.  For a kernel booted over the network, I would think that having all kernel functionality built in is a net win, unless you routinely don't use significant amounts of the kernel, but want them available as modules for those rare days you use them.

How did your kernel end up with PTP_1588_CLOCK_OPTIONAL not set to =y?  As I read the Kconfig language, it should have been =y, unless you had made PTP support a module:

```
     8   config PTP_1588_CLOCK

    10      depends on NET && POSIX_TIMERS

    11      default ETHERNET

    30   config PTP_1588_CLOCK_OPTIONAL

    31      tristate

    32      default y if PTP_1588_CLOCK=n

    33      default PTP_1588_CLOCK

```

----------

## grknight

 *mortonP wrote:*   

> I figured it out, by basically brute-force bisecting the .config changes between 5.14 and 5.15 - much of the options I do not really understand what they do.

 

It is never a bad idea to run the old and new configs through the /usr/src/linux/scripts/diffconfig tool to see what has changed.  Especially good between major.minor releases just in case.

----------

## mortonP

 *Hu wrote:*   

> Perhaps there should be an initramfs, so you can drop in and look around when things don't work. ;)
> 
> How did your kernel end up with PTP_1588_CLOCK_OPTIONAL not set to =y?

 

I havn't used an initramfs for years... One kernel image file is enough to keep track of? :-)

I redid the 5.10 -> 5.15 .config upgrade and ended up again with PTP_1588 as module - either I'm too stupid or there is another dependency somewhere....

----------

## mortonP

 *grknight wrote:*   

> 
> 
> It is never a bad idea to run the old and new configs through the /usr/src/linux/scripts/diffconfig tool to see what has changed.  

 

Ooooh... that's a nice tool, didn't know about that yet. Thank you! :-)

----------

## mortonP

Now I upgraded 5.10 -> 5.15 also on the NFS server (also Gentoo) and client-side early boot fails now with

mount: /foobar... : mount(2) system call failed: Object is remote.

*sigh* Something changed server side, too...

Edit:

The mount error client-side is when the service-side NFS directory being exported contains a bind mount. So far this was not a problem, it seemingly is with 5.15 now.

The hang on boot client-side is actually a loooong delay, waiting a minute for "random fast init done". This seems to be a common problem on clients without keyboard that entropy is missing.

----------

## toralf

 *mortonP wrote:*   

> Hi...
> 
> Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x

 It will become probably an LTS but as of today it is not officially announced.

----------

## mortonP

 *toralf wrote:*   

>  *mortonP wrote:*   Hi...
> 
> Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x It will become probably an LTS but as of today it is not officially announced.

 

According to https://www.kernel.org/category/releases.html it is an LTS.

----------

## eccerr0r

I was about to make a new post on this but now I think it would have been a dupe...

I was trying to build a fresh PXE boot system.  Got the client machine to pull up a kernel just fine, but it fails to find init.

I see the request on the NFS server for rpc.mountd so DHCP and the mount request went out, but basically it sits there after the kernel dhcp client picks up data from the DHCP server...

It sits there for almost 100 seconds before it times out complaining about not finding init.

I suspect I'll have to try something other than a 5.15 kernel to see if my settings are correct or not.  Very weird.  Also I'll have to redo my initramfs as it does not do nfs at all, so no debug through that (though after dropping into the shell on the initramfs, it clearly can ping on the network, etc., so appears to be an NFS mounting issue at this point.)

---

hmmm. nevermind, might have different problems here after all, just got excited from the delay seen.  ip=dhcp should have dumped out the dhcp data which it does for me on kernel output, but it hangs for about 100 seconds just after that, then claims it can't find /sbin/init.  Fortunately or unfortunately I see the same results on qemu as I do on physical hardware...

----------

## gjaekel

I stepped into a comparable today while switiching from a kernel 5.10 to 5.15 on a CISCO blade center. Here, the blade also boot via PXE, the kernel and initrd is pulled via BOOTP, and it is IP-configured by DHCP

The boot process fails "inside" the initrd while attempting to switch to the new nfs root. It turns out, that the rootpath was empty. Unfortunately, this is not detected and handled as an error by the init script. For this reason, strange things happens resulting in a kernel panic. By using the kernel commandline parameter 

```
rdinit=/bin/sh
```

 i was able to interrupt the boot process to have a look at the kernel messages.

The blade is configured to provide two NICs (eth0 and eth1). It turns out, that "now" the eth1 NIC becomes ready before eth0. And by accident, it got an answer here from a foreign DHCP server using an DHCP pool which offers no such options as the rootpath.

As written, this is by accident and is the result of an other one's misconfiguration. But this don't seem to never happen booting the 5.10 kernel, but always with the 5.15 kernel .  Please note that booting the blade is very uncomfortable, because the BIOS hardware test takes more than 2 minutes; i.e. "never" and "always" should be read as 3 of 3 times. Therefore, it seems to be related to some "minor" changes that with the newer kernel the eth1 becomes "ready and up" also (or before eth0) and the DHCP-client pick up the announcement on this network.

I was able to solve my issue by using 

```
ip=:::::eth0:dhcp
```

 instead of the former used, simple 

```
ip=dhcp
```

 at the kernel commandline provided by the DHCP server.

----------

