# [Solved] >=wpa_supplicant-2.10 panics my kernels

## sublogic

EDIT: deagol gave a patch in post 8701895 that will probably make it into the kernel tree.  Marking as solved.

I've been affected by the recent wpa-supplicant issue as well (thanks jburns for the tip about tkip).  But on my old, old laptop it's worse.  How old ?

```
$ egrep 'vendor_id|model name' /proc/cpuinfo | sort -u

model name   : Intel(R) Core(TM) Duo CPU      T2250  @ 1.73GHz

vendor_id   : GenuineIntel

$ free -m

               total        used        free      shared  buff/cache   available

Mem:             930         206         268           6         455         461

Swap:           2047          84        1963
```

How much worse ?

wpa_supplicant-2.9-r8: works.

wpa_supplicant-2.10: kernel panic.

wpa_supplicant-2.10-r1: no panic, but no wireless either, stuck in scanning according to wpa_cli.

wpa_supplicant-2.10-r1 USE=tkip: panic.

I keep a binary package of 2.9-r8 but that's not a permanent solution.

Here's a png of the screen after the panic.  It looks like a bug in the rtl818x_pci module, triggered by the newer wpa_supplicant.  In an interrupt, no less.  Chasing that is going to be fun.

I'm having no luck capturing a crash dump with kexec -p but that will be for another thread.  I'll post more actionable info there.

Has anyone else been hit this hard?  Just curious.Last edited by sublogic on Thu Apr 21, 2022 11:16 pm; edited 2 times in total

----------

## sublogic

Progress.  I got a crash dump.  Here's the dmesg with the panic, starting at the point where I started net.wlp8s9 .  Now to teach myself kernel debugging  :Shocked: .

```
# vmcore-dmesg /var/crash/vmcore | tail -n 56

[  140.809507] wlp8s9: authenticate with 2c:99:24:32:61:c9

[  140.986796] wlp8s9: send auth to 2c:99:24:32:61:c9 (try 1/3)

[  141.193344] wlp8s9: send auth to 2c:99:24:32:61:c9 (try 2/3)

[  141.195793] wlp8s9: authenticated

[  141.196656] wlp8s9: associate with 2c:99:24:32:61:c9 (try 1/3)

[  141.200157] wlp8s9: RX AssocResp from 2c:99:24:32:61:c9 (capab=0x431 status=0 aid=4)

[  141.200233] wlp8s9: associated

[  141.215267] divide error: 0000 [#1] SMP

[  141.215354] CPU: 1 PID: 3988 Comm: wpa_supplicant Kdump: loaded Not tainted 5.15.26-gentoo-x86 #1

[  141.215439] Hardware name: Gateway MX         /, BIOS 83.08 03/06/07

[  141.215508] EIP: rtl8180_tx+0x1c1/0x530 [rtl818x_pci]

[  141.215588] Code: 16 83 e0 0f 66 89 46 16 66 0b 87 9e 05 00 00 66 89 46 16 8b 75 f0 31 d2 c1 e6 05 8d 0c 37 8b 81 bc 00 00 00 03 81 ac 00 00 00 <f7> b1 b0 00 00 00 c1 e2 05 03 91 a4 00 00 00 83 bf 84 05 00 00 02

[  141.215697] EAX: 00000000 EBX: c5313180 ECX: c32098a0 EDX: 00000000

[  141.215767] ESI: 00000040 EDI: c3209860 EBP: c2a81a8c ESP: c2a81a5c

[  141.215837] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046

[  141.215910] CR0: 80050033 CR2: b7d46560 CR3: 02a7f000 CR4: 000006d0

[  141.215982] Call Trace:

[  141.216050]  ? rtl8180_interrupt+0x90/0x90 [rtl818x_pci]

[  141.216127]  ieee80211_tx_frags+0x131/0x1f0 [mac80211]

[  141.216434]  __ieee80211_tx+0x64/0x130 [mac80211]

[  141.216690]  ieee80211_tx+0xb2/0x100 [mac80211]

[  141.216949]  ieee80211_xmit+0xa9/0xe0 [mac80211]

[  141.217206]  __ieee80211_subif_start_xmit+0x946/0xbb0 [mac80211]

[  141.217471]  ieee80211_tx_control_port+0x16d/0x1c0 [mac80211]

[  141.217731]  ? __ieee80211_tx_skb_tid_band+0x80/0x80 [mac80211]

[  141.217833]  nl80211_tx_control_port+0x15c/0x2c0 [cfg80211]

[  141.217833]  genl_rcv_msg+0x135/0x320

[  141.217833]  ? __slab_free+0x99/0x280

[  141.217833]  ? nl80211_msg_put_channel.part.0+0x5c0/0x5c0 [cfg80211]

[  141.217833]  ? genl_get_cmd+0xf0/0xf0

[  141.217833]  netlink_rcv_skb+0x33/0xc0

[  141.217833]  genl_rcv+0x21/0x30

[  141.217833]  netlink_unicast+0x1b5/0x2c0

[  141.217833]  netlink_sendmsg+0x263/0x450

[  141.217833]  ? netlink_unicast+0x2c0/0x2c0

[  141.217833]  sock_sendmsg+0x5c/0x60

[  141.217833]  ____sys_sendmsg+0x1a2/0x1f0

[  141.217833]  ? import_iovec+0x13/0x20

[  141.217833]  ___sys_sendmsg+0x8d/0xb0

[  141.217833]  ? __mod_memcg_lruvec_state+0x34/0x70

[  141.217833]  ? _copy_to_user+0x17/0x30

[  141.217833]  ? unlock_page_memcg+0x53/0xd0

[  141.217833]  ? page_add_file_rmap+0x98/0x1c0

[  141.217833]  ? do_set_pte+0xab/0x150

[  141.217833]  __sys_sendmsg+0x32/0x70

[  141.217833]  __ia32_sys_socketcall+0x27a/0x320

[  141.217833]  __do_fast_syscall_32+0x4c/0xc0

[  141.217833]  do_fast_syscall_32+0x29/0x60

[  141.217833]  do_SYSENTER_32+0x15/0x20

[  141.217833]  entry_SYSENTER_32+0x98/0xe7

[  141.217833] EIP: 0xb7f1f549

[  141.217833] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76

[  141.217833] EAX: ffffffda EBX: 00000010 ECX: bf8061e0 EDX: 00000000

[  141.217833] ESI: b7b31000 EDI: 01dfe2d0 EBP: 01dfe250 ESP: bf8061d0

[  141.217833] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200286

[  141.217833] Modules linked in: lz4 lz4_compress sunrpc rtl818x_pci mac80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic i2c_algo_bit ledtrig_audio drm_ttm_helper snd_hda_intel ttm snd_intel_dspcfg snd_intel_sdw_acpi drm_kms_helper snd_hda_codec cfg80211 cec snd_hda_core rc_core drm snd_hwdep snd_pcm snd_timer eeprom_93cx6 snd sr_mod rfkill fb_sys_fops libarc4 syscopyarea soundcore coretemp sysfillrect joydev cdrom pcspkr serio_raw sky2 video sysimgblt i2c_piix4 backlight ata_generic mac_hid i2c_core ohci_pci pata_acpi acpi_cpufreq

```

----------

## Gentlenoob

Guess I won't be of much help, just got triggered by having an old laptop (Samsung Q35) with similar (same?) processor, fairly up to date, in particular wpa_supplicant-2.10-r1 with tkip set now, although not in use. Works just fine, WiFi hardware seems different, though (Intel 3945ABG).

This I've never seen

```

divide error: 0000 [#1] SMP

```

Wild-ass guessing on basic things:

 modules and kernel (here: 5.10.103) coherent? (forgotten make modules_install?)

 choice of CPU for kernel? here: Pentium M, CONFIG_MPENTIUMM=y

 CPU_flags in make.conf? here according to cpuid2cpuflags 

  CPU_FLAGS_X86="mmx mmxext sse sse2 sse3"

Good luck,

  Ralph

----------

## Hu

That kernel stack trace might be more useful with verbose debug information, so that we could see file and line number details.

Or you could ignore it and just disable TKIP.  No one should be using TKIP if avoidable.  I see in your earlier post that disabling TKIP caused other problems, but fixing those might be easier than debugging this kernel crash.

----------

## drvolk68

I want just let you know that i am having the same issue (no kernel crash, but no AP  available when scan), on my desktop and my notebook. Both have ryzen CPU , maybe this has something to do with it? I also use ~amd64 as keyword in make.conf and a hardened gentoo kernel. I had to install libressl overlay to get  back to 2.9 Version which works without any problem.

UPDATE:

Just noticed that there is another thread for exactly the issue i have. There the solution was to set tkip use flag (seams to be that old routers or so need that .. did not undertand realy  :Wink: 

----------

## Hu

 *drvolk68 wrote:*   

> I want just let you know that i am having the same issue (no kernel crash, but no AP  available when scan)

 OP has a kernel crash.  You do not.  Therefore, you do not have the same issue.  I am glad you found the TKIP thread, though as discussed there, the proper fix is not to use TKIP on the router.  Posting that you have "the same" issue when your problem is actually different will usually lead to people giving you advice that is relevant to the original issue, not to your variant of it.

----------

## sublogic

 *Hu wrote:*   

> That kernel stack trace might be more useful with verbose debug information, so that we could see file and line number details.

 

Yes.  I thought this would do it:

```
$ </proc/config.gz zgrep -i debug.info

CONFIG_DEBUG_INFO=y

# CONFIG_DEBUG_INFO_REDUCED is not set

# CONFIG_DEBUG_INFO_COMPRESSED is not set

# CONFIG_DEBUG_INFO_SPLIT is not set

CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y

# CONFIG_DEBUG_INFO_DWARF4 is not set

# CONFIG_DEBUG_INFO_DWARF5 is not set

# CONFIG_DEBUG_INFO_BTF is not set

```

The kernel Makefile adds -g to DEBUG_CFLAGS when CONFIG_DEBUG_INFO=y (or it looks like it does) so that *should* have done it ?

 *Quote:*   

> Or you could ignore it and just disable TKIP.  No one should be using TKIP if avoidable.  I see in your earlier post that disabling TKIP caused other problems, but fixing those might be easier than debugging this kernel crash.

 

The router comes from the ISP and the web admin page is disabled.  Bah.

Rebuiling the kernel now.  Also streamling the .config and initramfs to make this process easier.  Stay tuned.  And thanks.

----------

## sublogic

Sigh.  I'm doing everything right.  Kernel has debugging info, got a post-panic dump (vmcore) with kexec, pulled the dmesg with vmcore-dmesg utility but the panic traceback at the end has no line numbers.  To do more I need the crash utility.  So I emerged it, but:

```
$ file vmcore

vmcore: ELF 32-bit LSB core file, Intel 80386, version 1 (SYSV), SVR4-style

$ file vmlinux

vmlinux: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, BuildID[sha1]=00e96c016ec7cadf7b116778f0174524fa0da42d, with debug_info, not stripped

$ crash vmlinux vmcore

crash 8.0.0

... <snip> ...

For help, type "help".

Type "apropos word" to search for commands related to "word"...

crash: cannot resolve "sys_open"

$
```

These "cannot resolve" errors happen all the time.  For example you can find fixes for them scattered in this crash changelog.  The crash developers must have a hard time keeping up with internal kernel APIs.

I filed a bug report on their github, that's all I can do for now.

----------

## Hu

divide error may mean that the kernel attempted a division by zero.  A cursory inspection of the affected code shows there are several sites which use % where the denominator comes from a variable.  One such, which could even match your output, is rtl8180_tx, which will take:

```
idx = (ring->idx + skb_queue_len(&ring->queue)) % ring->entries;
```

If you want to continue this, you could disassemble the faulting function and try to resolve the line number manually.  You could try patching the division site(s) to test for a zero first, and log some error before aborting the operation.

This file appears not to be touched often.  It has not changed between v5.15 and Linus' current tree.

----------

## sublogic

 *Hu wrote:*   

> If you want to continue this, you could disassemble the faulting function and try to resolve the line number manually.  You could try patching the division site(s) to test for a zero first, and log some error before aborting the operation.

 

Thanks, good idea.  And/or, I could hack dev-util/crash into submission.  Woah, that was easy.  Tiny patch submitted upstream.  Now I have to learn crash.  Should I post the patch in the gentoo bugzilla so we don't have to wait for upstream ?

```
$ crash /var/crash/vm{linux,core}

crash 8.0.0

Copyright (C) 2002-2021  Red Hat, Inc.

... <snip> ...

For help, type "help".

Type "apropos word" to search for commands related to "word"...

WARNING: cannot determine hardirq_ctx addresses 

WARNING: cannot determine softirq_ctx addresses

      KERNEL: /var/crash/vmlinux        

    DUMPFILE: /var/crash/vmcore

        CPUS: 2

        DATE: Sat Mar 26 20:48:55 EDT 2022

      UPTIME: 00:02:04

LOAD AVERAGE: 0.98, 0.52, 0.20

       TASKS: 107

    NODENAME: gateway

     RELEASE: 5.15.26-gentoo-x86-1

     VERSION: #1 SMP Sat Mar 26 19:41:52 EDT 2022

     MACHINE: i686  (1729 Mhz)

      MEMORY: 887.6 MB

       PANIC: "divide error: 0000 [#1] SMP"

         PID: 4026

     COMMAND: "wpa_supplicant"

        TASK: c27b8000  [THREAD_INFO: c27b8000]

         CPU: 1

       STATE: TASK_RUNNING (PANIC)

crash> 

```

 *Hu wrote:*   

> This file appears not to be touched often.  It has not changed between v5.15 and Linus' current tree.

 

I wouldn't expect a lot of activity.  It's a driver for a 2004 wireless chip.

----------

## Hu

 *sublogic wrote:*   

> I could hack dev-util/crash into submission.  Woah, that was easy.  Tiny patch submitted upstream.  Now I have to learn crash.  Should I post the patch in the gentoo bugzilla so we don't have to wait for upstream ?

 I think that would be fine.  If it is small and fixes a clear bug in the existing in-tree version, I expect that having it backported would be useful. *sublogic wrote:*   

> I wouldn't expect a lot of activity.  It's a driver for a 2004 wireless chip.

 That's fair.  I was thinking that this was a kernel regression, and that if the file had been touched, then we could blame a recent change in that file for your problem.  Since your original report was that the version of wpa_supplicant is the controlling factor, it may be that this driver has been wrong for a very long time, and only a recent change to wpa_supplicant exposed the driver bug.  In that case, it is probably simpler to fix the driver not to panic, then work toward having it behave reasonably.

----------

## Rutcha

same here for ' Intel(R) Centrino(R) Advanced-N 6230 AGN, REV=0xB0 '

----------

## sublogic

 *Rutcha wrote:*   

> same here for ' Intel(R) Centrino(R) Advanced-N 6230 AGN, REV=0xB0 '

 

Rutcha, same as in "no wireless", or same as in "panic" ?  If it's a panic, post the relevant part of "lspci -kv".  I want to compare wireless adapters and kernel drivers.  Thanks.  Here's mine:

```
$ lspci -vk |tail -n 10

08:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8185 IEEE 802.11a/b/g Wireless LAN Controller (rev 20)

   Subsystem: Realtek Semiconductor Co., Ltd. RTL-8185 IEEE 802.11a/b/g Wireless LAN Controller

   Flags: bus master, medium devsel, latency 64, IRQ 22

   I/O ports at b000 [size=256]

   Memory at d0200000 (32-bit, non-prefetchable) [size=512]

   Capabilities: <access denied>

   Kernel driver in use: rtl818x_pci

   Kernel modules: rtl818x_pci
```

Fixing the driver won't be easy.  It's listed as "Orphan" in linux/MAINTAINERS .  Any kernel hackers around ?

----------

## sublogic

Okay, here's what I found so far using the crash and objdump utilities.  The start of the panic in dmesg:

```
[  124.985777] divide error: 0000 [#1] SMP

[  124.985862] CPU: 1 PID: 4026 Comm: wpa_supplicant Kdump: loaded Not tainted 5

.15.26-gentoo-x86-1 #1

[  124.985948] Hardware name: Gateway MX         /, BIOS 83.08 03/06/07

[  124.986016] EIP: rtl8180_tx+0x1c1/0x530 [rtl818x_pci]

```

EIP is the program counter at the crash point.  rtl8180_tx is a function in the rtl818x_pci module.  Here's the beginning of it, starting at line 454.

```
crash> mod -s rtl818x_pci

 MODULE   NAME                     BASE      SIZE  OBJECT FILE

f8337100  rtl818x_pci            f832e000   45056  /lib/modules/5.15.26-gentoo-x86-1/kernel/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl818x_pci.ko 

crash> 

crash> gdb list *rtl8180_tx

0xf832f8c0 is in rtl8180_tx (drivers/net/wireless/realtek/rtl818x/rtl8180/dev.c:457).

452     }

453     

454     static void rtl8180_tx(struct ieee80211_hw *dev,

455                            struct ieee80211_tx_control *control,

456                            struct sk_buff *skb)

457     {

458             struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);

459             struct ieee80211_hdr *hdr = (struct ieee80211_hdr *)skb->data;

460             struct rtl8180_priv *priv = dev->priv;

461             struct rtl8180_tx_ring *ring;

crash> 

```

and here's the listing centered on the crashing line, line 544.

```
crash> gdb list *rtl8180_tx+0x1c1

0xf832fa81 is in rtl8180_tx (drivers/net/wireless/realtek/rtl818x/rtl8180/dev.c:544).

539                             priv->seqno += 0x10;

540                     hdr->seq_ctrl &= cpu_to_le16(IEEE80211_SCTL_FRAG);

541                     hdr->seq_ctrl |= cpu_to_le16(priv->seqno);

542             }

543     

544             idx = (ring->idx + skb_queue_len(&ring->queue)) % ring->entries;

545             entry = &ring->desc[idx];

546     

547             if (priv->chip_family == RTL818X_CHIP_FAMILY_RTL8187SE) {

548                     entry->frame_duration = cpu_to_le16(frame_duration);

crash> 

```

The variable ring was initialized a bit earlier.

```
460             struct rtl8180_priv *priv = dev->priv; 

...

473             prio = skb_get_queue_mapping(skb);

474             ring = &priv->tx_ring[prio];

```

Anyway, Hu was exactly right in post 8696042.

Here is the disassembly of line 544 (from objdump -S).

```
        idx = (ring->idx + skb_queue_len(&ring->queue)) % ring->entries;

    1a6a:       8b 75 f0                mov    -0x10(%ebp),%esi

    1a6d:       31 d2                   xor    %edx,%edx

    1a6f:       c1 e6 05                shl    $0x5,%esi

    1a72:       8d 0c 37                lea    (%edi,%esi,1),%ecx

    1a75:       8b 81 bc 00 00 00       mov    0xbc(%ecx),%eax

    1a7b:       03 81 ac 00 00 00       add    0xac(%ecx),%eax

    1a81:       f7 b1 b0 00 00 00       divl   0xb0(%ecx)

```

The assembly references memory at register ecx + three offsets, 0xb0, 0xac, 0xbc.  Those must be the three accesses to *ring fields in the source line.  The 0xb0(%ecx) must be ring->entries, the divisor.  From the first source listing, ring is a (struct rtl8180_tx_ring *) and ring->entries is at offset 12 (0xc) into that structure.

```
crash> whatis struct rtl8180_tx_ring

struct rtl8180_tx_ring {

    struct rtl8180_tx_desc *desc;

    dma_addr_t dma;

    unsigned int idx;

    unsigned int entries;

    struct sk_buff_head queue;

}

SIZE: 32

crash> 

```

(The first three fields are all size 4, as will be apparent later when I dump the struct.)  So *ring itself must be at ecx + (0xb0-0xc = 0xa4).  And we know ecx from the start of the panic dump.  Showing a little more:

```
[  124.985777] divide error: 0000 [#1] SMP

[  124.985862] CPU: 1 PID: 4026 Comm: wpa_supplicant Kdump: loaded Not tainted 5

.15.26-gentoo-x86-1 #1

[  124.985948] Hardware name: Gateway MX         /, BIOS 83.08 03/06/07

[  124.986016] EIP: rtl8180_tx+0x1c1/0x530 [rtl818x_pci]

[  124.986096] Code: 16 83 e0 0f 66 89 46 16 66 0b 87 9e 05 00 00 66 89 46 16 8b

 75 f0 31 d2 c1 e6 05 8d 0c 37 8b 81 bc 00 00 00 03 81 ac 00 00 00 <f7> b1 b0 00

 00 00 c1 e2 05 03 91 a4 00 00 00 83 bf 84 05 00 00 02

[  124.986204] EAX: 00000000 EBX: c84c9b40 ECX: c89318a0 EDX: 00000000

[  124.986276] ESI: 00000040 EDI: c8931860 EBP: c9eefa8c ESP: c9eefa5c

[  124.986346] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046

[  124.986419] CR0: 80050033 CR2: b7d125f0 CR3: 09133000 CR4: 000006d0

```

Fourth line from the bottom, ecx was 0xc89318a0 at the time of the crash.  ecx+0xa4 is 0xc8931944, the address of *ring.  So look there:

```
crash> struct rtl8180_tx_ring c8931944

struct rtl8180_tx_ring {

  desc = 0x0,

  dma = 0,

  idx = 0,

  entries = 0,

  queue = {

    next = 0x0,

    prev = 0x0,

    qlen = 0,

    lock = {

      {

        rlock = {

          raw_lock = {

            {

              val = {

                counter = 0

              },

              {

                locked = 0 '\000',

                pending = 0 '\000'

              },

              {

                locked_pending = 0,

                tail = 0

              }

            }

          }

        }

      }

    }

  }

}

crash> 

```

That looks highly suspicious.  Somebody is passing uninitialized data to rtl8180_tx().  I haven't been able to reach the argument "dev" from which ring is ultimately derived.  But clearly the bug is higher up in the call chain.

Anyway, I'll keep looking.  I'm learning as I go.

----------

## Hu

 *sublogic wrote:*   

> Fixing the driver won't be easy.  It's listed as "Orphan" in linux/MAINTAINERS .

 If you can identify a fix, you might be able to feed it through the general networking tree. *sublogic wrote:*   

> Anyway, I'll keep looking.  I'm learning as I go.

 You're doing a good job so far.  I was impressed to see the assembly analysis.  Many people would just give up when faced with the need to trace the code flow through the assembly. *sublogic wrote:*   

> Somebody is passing uninitialized data to rtl8180_tx().  I haven't been able to reach the argument "dev" from which ring is ultimately derived.  But clearly the bug is higher up in the call chain.

 The panic gives an approximated call stack, though that is not guaranteed to be correct in all cases.  It looks plausible here though.  As for reaching dev, one option would be to build a custom kernel that is specially patched to make this data easy to find in the crash dump.  For example (untested), add just above the division line:

```
   struct divzero_debugging_data {

      struct ieee80211_hw *dev;

   };

   static struct divzero_debugging_data dzdd;

   if (!ring->entries) {

      WRITE_ONCE(dzdd.dev, dev);

   }
```

Add more structure members / WRITE_ONCE calls as you see fit.  This stores the interesting values into a global scope structure, where you should be able to dump them using the debugging tools after the crash.

Once you commit to using a patched kernel for hunting this, you could also use that to make it easier to trace parts of the call chain.  Instrument various callers to save their __FILE__ / __LINE__, or other interesting local variables, into global variables just before calling the sequence of functions that ultimately crashes.

----------

## Rutcha

I'm sorry! My problem was not wpa_suplicant related.

I was brief and short at first, but because I didn't want to disturb much this thread. I actually panicked myself after updating wpa_supplicant and having no wifi/internet and not being able to google myself out of it.

I'm sorry!

I had not had any kernel panick, I was only unable to connect to wifi. Wpa_supplicant echoed only attempts to connect and failures ( wpa_supplicant -i wlp1s0 -c /etc/wpa_supplicant/wpa_supplicant.conf )

I tried a usb dongle after I posted here and still could'nt get a connection, so it was unlikely to be a hardware issue.

It turns out everything went back to normal after I got rid of a custom wireless-regdb database I had previously tampered to allow use of a wider frequency spectrum. I suppose I had to update that table too after updating wpa_supplicant, but anyway - for simplicity I changed back USE flags for wpa_supplicant and my kernel .config. I don't know for sure what made it worked, but I decided being more conservative in all these and something did make it go back to normal. First my wifi but not 5ghz, then everything.

Sorry again and thanks for being prompt.

Have a nice day!

----------

## sublogic

Quick update to post 8696272: I found the "dev" argument to rtl8180_tx().  The kernel is compiled with -mregparm=3 so the arguments are passed in registers, not in the stack where I was looking for them.  Anyway, the dev struct looks plausible.  I won't post it as it is pretty big.  The problem is this:

```
prio = skb_get_queue_mapping(skb);

ring = &priv->tx_ring[prio];

```

prio comes out equal to 2.  Meanwhile, priv->tx_ring is an array of five struct rtl8180_tx_ring;  the first two, [0,1] look sane.  The last three, [2,3,4] are filled with zeros.  It was priv->tx_ring[2] that caused the division by zero.

Next task was to find the caller.  It is ieee80211_tx_frags() called by __ieee80211_tx(), called by ieee80211_tx_pending_skb(), called by ieee80211_tx_pending(), all in the mac80211 module, the last one a tasklet callback.  So it is interrupt-driven.

Asynchronous bugs are fun !  Now to find who enqueued a pending packet with a bogus struct sk_buff .

I also would like to find out what triggered this bug in wpa_supplicant, but I'm not sure I can debug a userspace task from a vmcore.

To be continued...

----------

## sublogic

Small progress.  I obtained a new crash dump from a 5.15.32-gentoo-r1 kernel.  In the crashing function,

```
454     static void rtl8180_tx(struct ieee80211_hw *dev,

455                            struct ieee80211_tx_control *control,

456                            struct sk_buff *skb)

457     {

...
```

 it is the third parameter, skb, that is bogus.

The parameters dev, control and skb are passed in registers eax, edx and ecx.  eax is saved on the stack before it gets modified, so I can get to the "dev" pointer post-mortem from the crash dump.  The *dev structure looks sane (as far as I can tell).

edx is modified and not saved, but the "control" argument is unused, so no matter.

ecx is saved in ebx before it is modified, and ebx is referenced but unchanged until after the crash point.  So I have the address of *skb from the panic's register dump: ebx=0xc6a916c0 .

struct sk_buff *skb is a "socket buffer", defined in include/linux/skbuff.h .  It's a big mess of pointers, bitfields and unions.  Most of the fields are zero, starting with these two:

```
crash> struct sk_buff 0xc6a916c0 |head -n 5

struct sk_buff {

  {

    {

      next = 0x0,

      prev = 0x0,
```

From skbuff.h ,

```
 *      @next: Next buffer in list

 *      @prev: Previous buffer in list
```

and NULLS don't make sense in a live socket buffer.

Some fields are valid, I assume from a previous use of this memory:

```
crash> struct sk_buff.dev 0xc6a916c0

        dev = 0xc777c000,

crash> rd 0xc777c000 3

c777c000:  38706c77 00003973 00000000            wlp8s9......
```

Indeed, wlp8s9 is the name of my interface.  But a lot of fields are zero, including some pointers and values that shouldn't be (skb->sk, skb->tstamp).

The prio=2 that I identified as the cause of the crash in post 8696481 was derived from this corrupted *skb.

```
473             prio = skb_get_queue_mapping(skb);

...

crash> struct sk_buff.queue_mapping 0xc6a916c0

  queue_mapping = 2,

crash> 
```

Anyway.  I've traced the call chain to [mac80211]ieee80211_tx_pending(struct tasklet_struct *t) .  Tasklets are part of the interrupt system.  I need to find and instrument the code that queued a bogus tasklet_struct for deferred processing.  Not sure where to look.

----------

## deagol

Quite informative how to debug kernel crashes!

But for me it looks like you this question is already (mostly)answered:

 *sublogic wrote:*   

> 
> 
> I need to find and instrument the code that queued a bogus tasklet_struct for deferred processing.  Not sure where to look.

 The kernel crash already showed which functions were called up to the crash:

```
[  141.216050]  ? rtl8180_interrupt+0x90/0x90 [rtl818x_pci]

[  141.216127]  ieee80211_tx_frags+0x131/0x1f0 [mac80211]

[  141.216434]  __ieee80211_tx+0x64/0x130 [mac80211]

[  141.216690]  ieee80211_tx+0xb2/0x100 [mac80211]

[  141.216949]  ieee80211_xmit+0xa9/0xe0 [mac80211]

[  141.217206]  __ieee80211_subif_start_xmit+0x946/0xbb0 [mac80211]

[  141.217471]  ieee80211_tx_control_port+0x16d/0x1c0 [mac80211]

[  141.217731]  ? __ieee80211_tx_skb_tid_band+0x80/0x80 [mac80211]

[  141.217833]  nl80211_tx_control_port+0x15c/0x2c0 [cfg80211]

[  141.217833]  genl_rcv_msg+0x135/0x320

[  141.217833]  ? __slab_free+0x99/0x280

[  141.217833]  ? nl80211_msg_put_channel.part.0+0x5c0/0x5c0 [cfg80211]
```

Now it's some time I looked at the code but I'm pretty sure this must be an EAPOL packet directly queues from wpa_supplicant. Which should be either the EAPOL#2 or EAPOL#4.

First, the crashing process is wpa_supplicant, so this was - next to be sure - no queued packet. Second, we see that the packet was injected via nl80211_tx_control_port; And one feature of Control Port is, that packets can't be queued. (The driver gets the packet directly. The driver than can of course assign a queue for priorities.)

Lastly control_port is only used for eapol frames and when you connect to a AP the only eapol frames sent from the client are EAPOL#2 and EAPOL#4.

Checking wpa_supplicant the sole caller of nl80211_tx_control_port is wpa_ether_send in wpa_supplicant/wpas_glue.c (ibss_rsn.c is for handling Ad-Hoc networks, which we can ignore as long a you connect to an AP)

So where the skb came from and the exact paths through wpa_supplicant and the kernel is more or less known.

It would be interesting if the problem goes away when you avoid usage of control port. One way would be, to prevent this if condition in wpa_supplicant/wpas_glue.c to be true:

```
        if (wpa_s->drv_flags & WPA_DRIVER_FLAGS_CONTROL_PORT) {

                int encrypt = wpa_s->wpa &&

                        wpa_sm_has_ptk_installed(wpa_s->wpa);

                return wpa_drv_tx_control_port(wpa_s, dest, proto, buf, len,

                                               !encrypt);

        }
```

That should at least remove the reference to nl80211_tx_control_port from the crash and maybe even sidestep the issue when the problem is so something on the contol port path.

edit:

Just checked and wpa_supplicant 2.9 did not support control Port. So wpa_supplicant 2.10 is the first (stable) version using control port for EAPOL.

One wild guess from looking at the driver code:

There are two card variants in the driver you use: RTL8180 with two queues and RTL818X with 5. According to lspci you have a RTL-8185, so you should have 5 correctly initialized queues. But you only found two good looking queue structs... 

So the driver may report 5 queues but only initialize two due to some bug. Now with wpa_supplicant 2.10 control port is marking the packet as high priority, scheduling it for one of the not initialized queues and ... division by zero.

Pretty sure disabling control port or setting RTL818X_NR_TX_QUEUES = 2 in drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h must then avoid the division by zero.

----------

## sublogic

 *deagol wrote:*   

> Quite informative how to debug kernel crashes!

 No, no.  I'm faking it.  And I appreciate your help with the userspace side and the WPA protocol.

Here are a few references for the crash utility.  Out of date, all of them, but a lot better than nothing.

https://crash-utility.github.io/crash_whitepaper.html

https://www.dedoimedo.com/computers/crash-book.html

https://lucasvr.gobolinux.org/etc/Debugging%20the%20kernel%20with%20Crash%20tool.pdf

And to get the kernel core dump,

https://wiki.gentoo.org/wiki/Kernel_Crash_Dumps

topic 1147842

 *deagol wrote:*   

>  There are two card variants in the driver you use: RTL8180 with two queues and RTL818X with 5. According to lspci you have a RTL-8185, so you should have 5 correctly initialized queues. But you only found two good looking queue structs...

 They do that on purpose, according to comments in rtl8180.h:

```
/* rtl8180/rtl8185 have 3 queue + beacon queue.

 * mac80211 can use just one, + beacon = 2 tot.

 */

#define RTL8180_NR_TX_QUEUES 2

/* rtl8187SE have 6 queues + beacon queues

 * mac80211 can use 4 QoS data queue, + beacon = 5 tot

 */

#define RTL8187SE_NR_TX_QUEUES 5

/* for array static allocation, it is the max of above */

#define RTL818X_NR_TX_QUEUES 5
```

More comments in dev.c:167, that I don't understand:

```
/* Queues for rtl8180/rtl8185 cards

 *

 * name | reg  |  prio

 *  BC  |  7   |   3

 *  HI  |  6   |   0

 *  NO  |  5   |   1

 *  LO  |  4   |   2

 *

 * The complete map for DMA kick reg using all queue is:

 * static const int rtl8180_queues_map[RTL8180_NR_TX_QUEUES] = {6, 5, 4, 7};

 *

 * .. but .. Because the mac80211 needs at least 4 queues for QoS or

 * otherwise QoS can't be done, we use just one.

 * Beacon queue could be used, but this is not finished yet.

 * Actual map is:

 *

 * name | reg  |  prio

 *  BC  |  7   |   1  <- currently not used yet.

 *  HI  |  6   |   x  <- not used

 *  NO  |  5   |   x  <- not used

 *  LO  |  4   |   0  <- used

 */

static const int rtl8180_queues_map[RTL8180_NR_TX_QUEUES] = {4, 7};
```

I don't know, I saw code that sets prio=1, so much for "currently not used".

Anyway, the initialization is at dev.c:rtl8180_init_hw():856.  My priv->chip_family is RTL818X_CHIP_FAMILY_RTL8185.

```
        /* mac80211 queue have higher prio for lower index. The last queue

         * (that mac80211 is not aware of) is reserved for beacons (and have

         * the highest priority on the NIC)

         */

        if (priv->chip_family != RTL818X_CHIP_FAMILY_RTL8187SE) {

                rtl818x_iowrite32(priv, &priv->map->TBDA,

                                  priv->tx_ring[1].dma);

                rtl818x_iowrite32(priv, &priv->map->TLPDA,

                                  priv->tx_ring[0].dma);

        } else {

                rtl818x_iowrite32(priv, &priv->map->TBDA,

                                  priv->tx_ring[4].dma);

                rtl818x_iowrite32(priv, &priv->map->TVODA,

                                  priv->tx_ring[0].dma);

                rtl818x_iowrite32(priv, &priv->map->TVIDA,

                                  priv->tx_ring[1].dma);

                rtl818x_iowrite32(priv, &priv->map->TBEDA,

                                  priv->tx_ring[2].dma);

                rtl818x_iowrite32(priv, &priv->map->TBKDA,

                                  priv->tx_ring[3].dma);

        }
```

 *deagol wrote:*   

> So the driver may report 5 queues but only initialize two due to some bug.

 

It seems to report only one queue:

```
crash> struct ieee80211_hw.queues c6ae8500

  queues = 1,

crash>
```

Also,

```
$ ls /sys/class/net/wlp8s9/queues

rx-0  tx-0
```

Is the driver correct about the 8185 hardware ?  Is it out of date with its support of cfg80211 ?

 *deagol wrote:*   

> It would be interesting if the problem goes away when you avoid usage of control port. One way would be, to prevent this if condition in wpa_supplicant/wpas_glue.c to be true:
> 
> ```
>         if (wpa_s->drv_flags & WPA_DRIVER_FLAGS_CONTROL_PORT) {
> 
> ...

 Well, I can ebuild-configure wpa_supplicant and edit WPA_DRIVER_FLAGS_CONTROL_PORT to #undef before finishing the merge.  Kind of clumsy.  I don't know how to write ebuilds yet.  I suppose I could script it.  It would be a useful band-aid if I lose my binpkg of wpa_supplicant-2.9 .

I'd rather hack the driver though.  It should fail but not crash.  If I can assume that the broken struct sk_buff is not utter garbage, but is just reliably beyond rtl8180's capability, I can just check for prio>1 ?  Near the top of the crashing function rtl8180_tx() is an error exit if the DMA mapping fails.  I could clone that with a check on prio ?  What will userspace do ?  Then again, what do I have to lose, a kernel panic ?  Hah !

----------

## deagol

We can also disable control port in mac80211, when you prefer kernel patches. Not tested it, but removing those lines in net/mac80211/main.c should to the trick:

```
    wiphy_ext_feature_set(wiphy,

                  NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211);

    wiphy_ext_feature_set(wiphy,

                  NL80211_EXT_FEATURE_CONTROL_PORT_NO_PREAUTH);

    wiphy_ext_feature_set(wiphy,

                  NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211_TX_STATUS);

```

 It's probably enough to remove the fist statement, but since the other two features depend on the first...

But patching gentoo ebuilds is really simple. For me it's even one of the main reason I use gentoo.

When you want to try modifications to any gentoo ebuild user patches are often the simplest way:https://wiki.gentoo.org/wiki//etc/portage/patches

But for a quick shot I normally just build the package with ebuild. This can be tuned a bit when you want to try different modifications but let's start with the basic approach: (The paths may be different for you, of course) become root:

```
sudo -s
```

 get wpa_supplicant ready for compilation:

```
ebuild /var/db/repos/gentoo/net-wireless/wpa_supplicant/wpa_supplicant-2.10-r1.ebuild configure
```

 go to the prepared src directory:

```
cd /var/tmp/portage/net-wireless/wpa_supplicant-2.10-r1/work/wpa_supplicant-2.10/wpa_supplicant
```

 simply edit the whatever you want

```
<e.g. change the if statement I pointed out to "if(0)">
```

 Compile and install the modified source

```
cd -; ebuild /var/db/repos/gentoo/net-wireless/wpa_supplicant/wpa_supplicant-2.10-r1.ebuild merge
```

(I use the same procedure to create the above user patches by making a copy of the src directory prior of editing it and then run diff.)

Control Port frames are very rare, making it easy to debug with some printk statments. 

In the kernel I would therefore just try this small debug patch:

```
--- linux-5.17.3-gentoo/net/mac80211/tx.c   2022-03-20 21:14:17.000000000 +0100

+++ linux-5.17.3-gentoo_patched/net/mac80211/tx.c   2022-04-17 10:41:34.148139147 +0200

@@ -5721,6 +5721,7 @@

    if (proto != sdata->control_port_protocol &&

        proto != cpu_to_be16(ETH_P_PREAUTH))

       return -EINVAL;

+   printk("DDD: CONTROL PORT FRAME\n");

 

    if (proto == sdata->control_port_protocol)

       ctrl_flags |= IEEE80211_TX_CTRL_PORT_CTRL_PROTO |

@@ -5762,9 +5763,11 @@

    if (ieee80211_lookup_ra_sta(sdata, skb, &sta) == 0 && !IS_ERR(sta)) {

       u16 queue = __ieee80211_select_queue(sdata, sta, skb);

 

+      printk("DDD: set queue = %i\n", queue);

       skb_set_queue_mapping(skb, queue);

       skb_get_hash(skb);

    }

+   printk("DDD: queue is %i\n", skb_get_queue_mapping(skb));

 

    rcu_read_unlock();

```

I would expect that queue is set to 2 by skb_set_queue_mapping here. (On my system with an iwldvm card the queue is set to zero here.) If true you can then simply force it to zero or one. When that works we have verified that something is wrong with the queues for your driver and start figuring out what exactly.

edit:

You should be able to crash the kernel with the old working wpa_supplicant by sending high prio packets after you have connected.

Also not tested but according to my understanding of https://wireless.wiki.kernel.org/en/developers/documentation/mac80211/queues something like this should work:

```
iperf3 -S 0xE0 -c <some route or reachable IP>
```

But when this is not crashing the kernel we don't know if it was not working as intended or if the issue are not the queues...

edit2:

Turns out iperf will next to be sure not be able to crash the kernel. Looks like there is a bug in the Contol Port path only.Last edited by deagol on Mon Apr 18, 2022 11:01 am; edited 1 time in total

----------

## deagol

I had a closer look at the code and I think I found the issue. I'm not sure that this is the proper fix: But this aligns control port frames to how "normal" packets are handled. 

For me it looks like that is also wrong: WME and the mac80211 pull API are different things and it currently looks like that only mac80211 drivers implementing the pull API are able to correctly use WME (QoS).

Nevertheless this patch should fix the issue if my understanding of what happens is right:

```
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c

index 6d054fed062f..072bdb5a7fe0 100644

--- a/net/mac80211/tx.c

+++ b/net/mac80211/tx.c

@@ -5759,7 +5759,8 @@ int ieee80211_tx_control_port(struct wiphy *wiphy, struct net_device *dev,

     */

    rcu_read_lock();

 

-   if (ieee80211_lookup_ra_sta(sdata, skb, &sta) == 0 && !IS_ERR(sta)) {

+   if (local->ops->wake_tx_queue &&

+       ieee80211_lookup_ra_sta(sdata, skb, &sta) == 0 && !IS_ERR(sta)) {

       u16 queue = __ieee80211_select_queue(sdata, sta, skb);

 

       skb_set_queue_mapping(skb, queue);

-- 
```

When you confirm that is working I'll prepare something for the wireless mailing list, so we officially can sort that out.

----------

## sublogic

Okay.  I 'll put the kernel under revision control and try your patch.

Incidentally, "iperf3 -S 0xE0 -c <IP address>" doesn't crash.  Sorry.

----------

## sublogic

@deagol: what do you know, the patch worked !

We must be using different kernel versions.  I have 5.15.32-r1 .  I had to apply the patch 34 lines earlier than in your diff, but with a two-liner that was easy.

I'd like to follow the thread on the wireless mailing list.  Please post a link.

THANK YOU !  Good work.

----------

## deagol

Things are not as complex as I initially assumed. Mac80211 - and anybody else who wants - is allowed to set the skb priority as it desires and the driver just should not select a not available queue based on that.

Can you undo all previous patches and test if this also fixes the issue?

```
diff --git a/drivers/net/wireless/realtek/rtl818x/rtl8180/dev.c b/drivers/net/wireless/realtek/rtl818x/rtl8180/dev.c

index 2477e18c7cae..025619cd14e8 100644

--- a/drivers/net/wireless/realtek/rtl818x/rtl8180/dev.c

+++ b/drivers/net/wireless/realtek/rtl818x/rtl8180/dev.c

@@ -460,8 +460,10 @@ static void rtl8180_tx(struct ieee80211_hw *dev,

    struct rtl8180_priv *priv = dev->priv;

    struct rtl8180_tx_ring *ring;

    struct rtl8180_tx_desc *entry;

+   unsigned int prio = 0;

    unsigned long flags;

-   unsigned int idx, prio, hw_prio;

+   unsigned int idx, hw_prio;

+

    dma_addr_t mapping;

    u32 tx_flags;

    u8 rc_flags;

@@ -470,7 +472,9 @@ static void rtl8180_tx(struct ieee80211_hw *dev,

    /* do arithmetic and then convert to le16 */

    u16 frame_duration = 0;

 

-   prio = skb_get_queue_mapping(skb);

+   /* rtl8180/rtl8185 only has one useable tx queue */

+   if (dev->queues > IEEE80211_AC_BK)

+      prio = skb_get_queue_mapping(skb);

    ring = &priv->tx_ring[prio];

 

    mapping = dma_map_single(&priv->pdev->dev, skb->data, skb->len,

-- 

```

----------

## sublogic

Yes, the patch in post 8701895 works fine.  THANK YOU.

(I don't see any more skb_get_queue_mapping() calls to sanitize so that should do it ?)

----------

## deagol

Patch proposed to the wireless mailing list. Any potential follow up discussion will be there.

Link to the patch discussion: https://patchwork.kernel.org/project/linux-wireless/patch/20220422145228.7567-1-alexander@wetzel-home.de/

----------

