# BUG: unable to handle kernel paging request at 00023fa8

## josephg

i have had this a few random times since i upgraded to 4.14 earlier this month.

never seen this on 4.9 kernels used before

```
[14701.760274] BUG: unable to handle kernel paging request at 00023fa8

[14701.760285] IP: 0xcf2b3ab1

[14701.760287] *pdpt = 000000002eaae001 *pde = 0000000000000000

[14701.760292] Oops: 0000 [#1] SMP

[14701.760295] Modules linked in: ctr ccm af_packet nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter zram zsmalloc ext4 crc16 mbcache jbd2 bfq arc4 snd_hda_codec_realtek i915 i2c_algo_bit snd_hda_codec_generic drm_kms_helper ath9k cfbfillrect syscopyarea cfbimgblt sysfillrect ath9k_common sysimgblt fb_sys_fops cfbcopyarea snd_hda_intel input_leds ath9k_hw evdev drm lpc_ich sdhci_pci coretemp snd_hda_codec video sdhci psmouse sr_mod hwmon mac80211 ath atkbd cfg80211 pcspkr snd_hwdep mmc_core i2c_i801 rfkill thermal snd_hda_core fan led_class cdrom mfd_core pcc_cpufreq battery ehci_pci snd_pcm button snd_timer intel_agp acpi_cpufreq intel_gtt ac snd soundcore ehci_hcd agpgart backlight usbcore usb_common

[14701.760366] CPU: 0 PID: 25226 Comm: JS Helper Tainted: G     U          4.14.65-gentoo-josephg #16

[14701.760368] Hardware name: TOSHIBA Satellite Pro A300/Portable PC, BIOS 2.20 12/07/2009

[14701.760370] task: eebf2000 task.stack: eec3a000

[14701.760373] EIP: 0xcf2b3ab1

[14701.760375] EFLAGS: 00210296 CPU: 0

[14701.760377] EAX: 00023fa4 EBX: 9c7fe000 ECX: 00000000 EDX: 017fffff

[14701.760380] ESI: 017fffff EDI: 00000000 EBP: 00023fa0 ESP: eec3bde8

[14701.760382]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

[14701.760385] CR0: 80050033 CR2: 00023fa8 CR3: 3180c140 CR4: 000006f0

[14701.760387] Call Trace:

[14701.760390]  ? 0xcf2b3b8b

[14701.760393]  ? 0xceedbf59

[14701.760395]  ? 0xceedc4ac

[14701.760397]  ? 0xcef18a90

[14701.760399]  ? 0xcef19000

[14701.760401]  ? 0xcef0534b

[14701.760403]  ? 0xcee5c9b4

[14701.760405]  ? 0xcee5d1df

[14701.760407]  ? 0xcee5ee47

[14701.760410]  ? 0xcef06759

[14701.760412]  ? 0xcee2f23b

[14701.760414]  ? 0xcee2f4b0

[14701.760416]  ? 0xcf2c7810

[14701.760418] Code: d5 8b 74 24 14 8b 5c 24 18 85 d2 0f 84 0b ff ff ff e9 f5 fe ff ff 8d 74 26 00 55 57 56 53 83 ec 08 89 04 24 89 4c 24 04 8b 04 24 <8b> 70 04 89 f0 83 e0 03 83 f8 01 0f 85 a6 00 00 00 89 f0 83 e0

[14701.760460] EIP: 0xcf2b3ab1 SS:ESP: 0068:eec3bde8

[14701.760462] CR2: 0000000000023fa8

[14701.760465] ---[ end trace c9627314a74a4941 ]---
```

----------

## eccerr0r

Unfortunately this dump would be quite hard to make heads or tails from.

Can you replicate the bug with a kernel that has CONFIG_KALLSYMS turned on, else you will need to get ksymoops run on this data with your System.map...

----------

## josephg

Initially, I thought it might be caused by CONFIG_PGTABLE_MAPPING. So I unset it and recompiled this kernel. I guess that was not the cause, as I got it again.

Unfortunately, I don't know what causes this bug and when it happens. I've had it a few times, but not everyday or every time I do something. I don't know how to replicate it.

CONFIG_KALLSYMS is not turned on this kernel. I need to recompile this kernel, after I'm able to replicate this BUG.

I don't know how to use ksymoops. But I found a redhat man page. Not sure I can get my head around it.

And I can't find ksymoops in gentoo repos. But I found this:

```
$ cat /usr/src/linux/scripts/ksymoops/README

ksymoops has been removed from the kernel.  It was always meant to be a

free standing utility, not linked to any particular kernel version.

The latest version can be found in

https://www.kernel.org/pub/linux/utils/kernel/ksymoops together with patches to

other utilities in order to give more accurate Oops debugging.

Keith Owens <kaos@ocs.com.au> Sat Jun 19 10:30:34 EST 1999
```

I see .rpm and .tar.gz packages in there. Do I just emerge that with --usepkg?

----------

## NeddySeagoon

josephg,

Try 4.18.x, see if its fixed there. That's the gentoo testing kernel.

----------

## josephg

 *NeddySeagoon wrote:*   

> Try 4.18.x, see if its fixed there. That's the gentoo testing kernel.

 

Hi NeddySeagon, I could try 4.18, but I don't know how to make this BUG happen. For example, I have been using this laptop since I last had that oops and I posted this topic. I have rebooted a few times since. But it has happened again yet.

I don't know if this coincided with me testing nftables. I removed those modules from my kernel config, as I like to keep it clean and minimal. I don't know whether this matters either.

----------

## eccerr0r

At this point, since the problem isn't easily repeatable and ksymoops may not be readily accessible, might well just build a new kernel with the debug symbols built in, ready for the next event.

----------

## josephg

can i ask how/where did you get your ksymoops?

should i rebuild my kernel with debug symbols for normal runtime all the time, or just for catching bugs?

this bug hasn't happened all day today either. so i'm not really sure when it will happen again. i know when it happens though, because this laptop kinda hangs for a brief while and then comes back to life again like everything's ok.

----------

## eccerr0r

I haven't used ksymoops for ages, got tired of keeping track of System.map -- all my kernels are built with the symbols -- makes oops/panics much easier to debug, and never have to deal with raw oops especially if it's not logged (e.g., "Not Syncing" or if failure happens before syslogd/journald starts).

Try https://mirrors.edge.kernel.org/pub/linux/utils/kernel/ksymoops/v2.4/

----------

## josephg

Looking back, I think I got them after I started playing with nftables. I didn't get any further kernel bugs in my dmesg, after I removed all nftables from kernel and my system. I wonder if these could be linked. Perhaps not? So I enabled the nftables modules again in my kernel. And I caught this bug again yesterday, and the following one today

```
[Sep28 10:02] BUG: unable to handle kernel paging request at 00023fa8

[  +0.000018] IP: __radix_tree_lookup+0x11/0xe0

[  +0.000003] *pdpt = 000000002ac60001 *pde = 0000000000000000 

[  +0.000005] Oops: 0000 [#1] SMP

[  +0.000002] Modules linked in: nf_tables nfnetlink ctr ccm af_packet nf_log_ipv4 nf_log_common xt_LOG xt_pkttype xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables zram zsmalloc ext4 crc16 mbcache jbd2 bfq arc4 ath9k ath9k_common snd_hda_codec_realtek ath9k_hw snd_hda_codec_generic i915 mac80211 snd_hda_intel ath snd_hda_codec input_leds cfg80211 psmouse atkbd sdhci_pci snd_hwdep lpc_ich snd_hda_core i2c_algo_bit sdhci pcspkr mmc_core evdev drm_kms_helper led_class snd_pcm libps2 mfd_core rfkill cfbfillrect syscopyarea cfbimgblt sysfillrect ehci_pci sysimgblt fb_sys_fops cfbcopyarea coretemp drm sr_mod snd_timer cdrom i2c_i801 hwmon snd ehci_hcd pcc_cpufreq intel_agp fan thermal i8042 intel_gtt battery soundcore agpgart acpi_cpufreq

[  +0.000068]  rtc_cmos serio usbcore usb_common button ac video backlight

[  +0.000010] CPU: 0 PID: 21126 Comm: java Tainted: G     U          4.14.65-gentoo-jgv #24

[  +0.000002] Hardware name: TOSHIBA Satellite Pro A300/Portable PC, BIOS 2.20 12/07/2009

[  +0.000002] task: ee02c000 task.stack: eae7a000

[  +0.000005] EIP: __radix_tree_lookup+0x11/0xe0

[  +0.000002] EFLAGS: 00010286 CPU: 0

[  +0.000002] EAX: 00023fa4 EBX: 9dea3000 ECX: 00000000 EDX: 017fffff

[  +0.000003] ESI: 017fffff EDI: 00000000 EBP: 00023fa0 ESP: eae7bde4

[  +0.000003]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

[  +0.000002] CR0: 80050033 CR2: 00023fa8 CR3: 1bee60c0 CR4: 000006f0

[  +0.000002] Call Trace:

[  +0.000005]  ? radix_tree_lookup_slot+0xb/0x20

[  +0.000006]  ? find_get_entry+0x19/0xe0

[  +0.000003]  ? pagecache_get_page+0x1c/0x210

[  +0.000005]  ? lookup_swap_cache+0x30/0xf0

[  +0.000004]  ? swap_readahead_detect+0x60/0x2a0

[  +0.000003]  ? tick_program_event+0x36/0x70

[  +0.000005]  ? do_swap_page+0xbb/0x790

[  +0.000003]  ? memcg_check_events.isra.60+0x70/0x130

[  +0.000004]  ? sched_slice.isra.67+0x42/0x90

[  +0.000004]  ? mem_cgroup_commit_charge+0x62/0x3e0

[  +0.000003]  ? task_tick_fair+0x45f/0x680

[  +0.000003]  ? page_add_new_anon_rmap+0x5d/0xa0

[  +0.000003]  ? handle_mm_fault+0x669/0xf00

[  +0.000005]  ? __do_page_fault+0x19b/0x400

[  +0.000002]  ? vmalloc_sync_all+0x10/0x10

[  +0.000004]  ? common_exception+0x52/0x5a

[  +0.000002] Code: d5 8b 74 24 14 8b 5c 24 18 85 d2 0f 84 0b ff ff ff e9 f5 fe ff ff 8d 74 26 00 55 57 56 53 83 ec 08 89 04 24 89 4c 24 04 8b 04 24 <8b> 70 04 89 f0 83 e0 03 83 f8 01 0f 85 a6 00 00 00 89 f0 83 e0

[  +0.000043] EIP: __radix_tree_lookup+0x11/0xe0 SS:ESP: 0068:eae7bde4

[  +0.000002] CR2: 0000000000023fa8

[  +0.000004] ---[ end trace 5fe51ac9167d38b1 ]---
```

I enabled CONFIG_KALLSYMS when I recompiled my kernel. And I don't have swap.

Interestingly, my system hasn't gone kaput and is happily chugging away.

----------

## eccerr0r

Interesting.  This would indicate some problem with the memory manager / control groups and has no hints of network related issues.  Whether it's a side effect of what you're using in your kernel has yet to be determined.  The fact it's repeatable indicates your machine is building the kernel the same way.

Now I'd agree with Neddy and suggest trying a newer kernel, might be a bug in the particular kernel you're using that's being triggered by testing with nftables.

I wonder if it has something to do with the backpatch of PTI... :-(  I have not been able to trigger this oops yet though I see some fairly similar oops on the goog.

----------

## josephg

 *eccerr0r wrote:*   

> Interesting.  This would indicate some problem with the memory manager / control groups and has no hints of network related issues.  Whether it's a side effect of what you're using in your kernel has yet to be determined.

 

Shall I enable/disable either or both?

```
CONFIG_MEMCG=y

# CONFIG_MEMCG_SWAP is not set
```

 *eccerr0r wrote:*   

> The fact it's repeatable indicates your machine is building the kernel the same way.

 

It's repeatable, but I am not able to replicate this behaviour with something particular that I might be doing.

 *eccerr0r wrote:*   

> indicates your machine is building the kernel the same way.

 

Hmm I am using ccache.. suspect? On a slow machine, it is a boon. I'll try rebuilding this kernel without ccache, and see if it makes a difference.

 *eccerr0r wrote:*   

> Now I'd agree with Neddy and suggest trying a newer kernel, might be a bug in the particular kernel you're using that's being triggered by testing with nftables.

 

I'm on btrfs, and suffered major filesystem corruption when I switched up from 4.9 to briefly tried 4.12 when it was labelled stable before it was labelled unstable shortly thereafter. I had debian, ubuntu, arch, slackware, and a few other distros on different subvolumes and I lost all of them. I wasn't too bothered as I was primarily using gentoo by then. I was able to recover most of my gentoo files (not the filesystem) and could resurrect it on a fresh new filesystem. And I lost most of my hair at that time. Now I'm reluctant to jump kernels without evaluating kernel changelog for btrfs. It seems there are major changes post 4.14 till 4.18 and if I jump that ahead I might not be able to backtrack to my currently most stable kernel 4.9.

Yes I have 4.9 as my rock stable backup kernel and no issues at all. I started using 4.14 recently, after 4.18 was announced as the next LTS. If I continue to have problems with 4.14, I'd move back to 4.9 rather than jump forward to 4.18. This is just from my previous situations having been burnt many times over on the bleeding edge. I don't mind if this was a test system, but I use this one is my daily driver.

 *eccerr0r wrote:*   

> I wonder if it has something to do with the backpatch of PTI...   I have not been able to trigger this oops yet though I see some fairly similar oops on the goog.

 

Is there something I could try to trigger this oops? I got another one last night, but none yet all day today. But then, I haven't started nftables yet.

----------

## eccerr0r

You can't disable the memory manager, it's a fundamental part of the OS.  Disabling control groups will break new init software, so that would cause problems too.

User mode software should NOT trigger oops, so that shouldn't be an issue to use ccache unless it was causing it to build wrong.  Worth trying if you think a build problem caused the bad code sequence.

You need to have backups in any case.  I'd suggest copy your storage to another unit and try it with 4.18.  I figure someday the upgrade will be forced.

BTW, how close are you to running out of RAM?  In terms of this question, include cache and buffer memory as "used" memory (report "free" as-is without subtracting out anything...) 

As I have yet to exhibit this bug I don't know how to trigger it...

----------

## josephg

 *eccerr0r wrote:*   

> You can't disable the memory manager, it's a fundamental part of the OS.  Disabling control groups will break new init software, so that would cause problems too.

 

So I enabled these. But I don't have that new init software.

```
CONFIG_MEMCG=y

CONFIG_MEMCG_SWAP=y

CONFIG_MEMCG_SWAP_ENABLED=y
```

 *eccerr0r wrote:*   

> BTW, how close are you to running out of RAM?  In terms of this question, include cache and buffer memory as "used" memory (report "free" as-is without subtracting out anything...)

 

```
$ free

              total        used        free      shared  buff/cache   available

Mem:        4014136      565256     2840368       69228      608512     3030272

Swap:             0           0           0
```

----------

## eccerr0r

Then you may be okay to disable... thought that both openrc and systemd used cgroups to control resources during startup, maybe not.

It also will affect containers and if you're using it for portage, apparently.  But it's not been needed (with consequences thereof) for years.

Well, the thing about the memory...how much memory is it using when the oops occurs, perhaps that is harder to gauge.  Perhaps filling cache/buffers up can trigger the issue faster...

----------

## josephg

 *eccerr0r wrote:*   

> Then you may be okay to disable... thought that both openrc and systemd used cgroups to control resources during startup, maybe not.
> 
> It also will affect containers and if you're using it for portage, apparently.  But it's not been needed (with consequences thereof) for years.

 

ok, i've been testing with MEMCG on and off, and i think this problem happens more frequently with it on. i ran without memcg for a few days, and no problems. today i turned it on, and kernel oops.

```
[ 1184.956815] BUG: unable to handle kernel paging request at 00023fa8

[ 1184.956832] IP: __radix_tree_lookup+0x11/0xe0

[ 1184.956835] *pdpt = 0000000028796001 *pde = 0000000000000000 

[ 1184.956842] Oops: 0000 [#1] SMP

[ 1184.956844] Modules linked in: ctr ccm af_packet nf_log_ipv4 nf_log_common nft_log nf_conntrack_ipv4 nf_defrag_ipv4 nft_ct nf_conntrack nft_counter nft_meta nft_set_bitmap nft_set_hash nft_set_rbtree nf_tables_ipv4 nf_tables nfnetlink lzo zram zsmalloc ext4 crc16 mbcache jbd2 bfq arc4 i915 snd_hda_codec_realtek i2c_algo_bit drm_kms_helper snd_hda_codec_generic ath9k cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea ath9k_common drm ath9k_hw snd_hda_intel lpc_ich fb font fbdev snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_timer ath sdhci_pci snd input_leds psmouse sdhci cfg80211 evdev mfd_core mmc_core atkbd led_class sr_mod libps2 rfkill thermal fan i2c_i801 cdrom soundcore pcspkr battery video ehci_pci pcc_cpufreq i8042 rtc_cmos backlight serio ehci_hcd

[ 1184.956927]  intel_agp button ac usbcore intel_gtt acpi_cpufreq usb_common agpgart coretemp hwmon

[ 1184.956941] CPU: 0 PID: 15790 Comm: Compositor Tainted: G     U          4.14.65-gentoo-jgv #38

[ 1184.956944] Hardware name: TOSHIBA Satellite Pro A300/Portable PC, BIOS 2.20 12/07/2009

[ 1184.956947] task: e87c7000 task.stack: e87c2000

[ 1184.956951] EIP: __radix_tree_lookup+0x11/0xe0

[ 1184.956954] EFLAGS: 00210292 CPU: 0

[ 1184.956957] EAX: 00023fa4 EBX: 8d2ac000 ECX: 00000000 EDX: 017fffff

[ 1184.956960] ESI: 017fffff EDI: 00000000 EBP: 00023fa0 ESP: e87c3dec

[ 1184.956963]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

[ 1184.956967] CR0: 80050033 CR2: 00023fa8 CR3: 29980760 CR4: 000006f0

[ 1184.956969] Call Trace:

[ 1184.956974]  ? radix_tree_lookup_slot+0xb/0x20

[ 1184.956979]  ? find_get_entry+0x19/0xe0

[ 1184.956983]  ? pagecache_get_page+0x1c/0x210

[ 1184.956988]  ? lookup_swap_cache+0x30/0xf0

[ 1184.956992]  ? swap_readahead_detect+0x60/0x2a0

[ 1184.956997]  ? do_swap_page+0xbb/0x790

[ 1184.957002]  ? release_pages+0x249/0x2c0

[ 1184.957006]  ? __pagevec_lru_add_fn+0xd5/0x180

[ 1184.957011]  ? mem_cgroup_commit_charge+0x62/0x3e0

[ 1184.957011]  ? page_add_new_anon_rmap+0x5d/0xa0

[ 1184.957011]  ? handle_mm_fault+0x679/0xeb0

[ 1184.957011]  ? __do_page_fault+0x19b/0x400

[ 1184.957011]  ? vmalloc_sync_all+0x10/0x10

[ 1184.957011]  ? common_exception+0x52/0x5a

[ 1184.957011] Code: d5 8b 74 24 14 8b 5c 24 18 85 d2 0f 84 0b ff ff ff e9 f5 fe ff ff 8d 74 26 00 55 57 56 53 83 ec 08 89 04 24 89 4c 24 04 8b 04 24 <8b> 70 04 89 f0 83 e0 03 83 f8 01 0f 85 a6 00 00 00 89 f0 83 e0

[ 1184.957011] EIP: __radix_tree_lookup+0x11/0xe0 SS:ESP: 0068:e87c3dec

[ 1184.957011] CR2: 0000000000023fa8

[ 1184.957011] ---[ end trace f0e3fe5210dfa56c ]---
```

i think i can safely rule out nftables causing this, and i've been on nftables for the past few days without any kernel bugs popping up. my testing might not be very scientific, and i still can't make it happen.

 *eccerr0r wrote:*   

> Well, the thing about the memory...how much memory is it using when the oops occurs, perhaps that is harder to gauge.  Perhaps filling cache/buffers up can trigger the issue faster...

 

well, i have no idea when it's gonna oops up again. i ran dstat for a while, but it never happened at that time. i can run free after i notice it.  but by the time i look at it, there's always loadsa memory free. i don't know how to catch it while it oops.

----------

## josephg

btw i have had none of these problems on 4.9 series. i've gone back to my backup kernel 4.9.122 and no issue. i've thrown everything at it, including nftables, android-studio, sdk, etc. and all together too. no hiccups. it had iptables which i've removed now. maybe i'll stay with 4.9 a bit longer. it's not like i have the latest hardware, and since last month i feel like i have been road testing bleeding-edge and troubleshooting all the time. i actually wondered if something wrong with my hardware.

same config for 4.9 and 4.14 kernels from gentoo-sources. no extra use-flags either.

```
$ eix gentoo-sources

[I] sys-kernel/gentoo-sources

     Available versions:  

     (4.4.150) 4.4.150^bs

     (4.4.157) ~4.4.157^bs

     (4.4.159) ~4.4.159^bs

     (4.9.49-r1) 4.9.49-r1^bs

     (4.9.118) 4.9.118^bs

     (4.9.119) ~4.9.119^bs

     (4.9.120) 4.9.120^bs

     (4.9.120-r1) 4.9.120-r1^bs

     (4.9.122) 4.9.122^bs

     (4.9.128) ~4.9.128^bs

     (4.9.129) ~4.9.129^bs

     (4.9.130) ~4.9.130^bs

     (4.9.131) ~4.9.131^bs

     (4.14.52) *4.14.52^bs

     (4.14.65) 4.14.65^bs

     (4.14.71) ~4.14.71^bs

     (4.14.72) ~4.14.72^bs

     (4.14.73) ~4.14.73^bs

     (4.14.74) ~4.14.74^bs

     (4.18.9) ~4.18.9^bs

     (4.18.10) ~4.18.10^bs

     (4.18.11) ~4.18.11^bs

     (4.18.12) ~4.18.12^bs

       {build experimental symlink}

     Installed versions:  4.9.122(4.9.122)^bs(22:47:33 30/08/18)(-build -experimental -symlink) 4.14.65(4.14.65)^bs(00:21:30 31/08/18)(-build -experimental -symlink)

     Homepage:            https://dev.gentoo.org/~mpagano/genpatches

     Description:         Full sources including the Gentoo patchset for the 4.18 kernel tree
```

----------

