# [SOLVED]BUG: soft lockup - CPU#0 stuck for 11s! [java:10670]

## gentunian

Hi there,

I was having some weird problem with the keyboard. The keys stopped working, all of them. The only combination possible was to do a safe restart with ctrl+alt+sysreq. First i suppose a hacker attack (paranoia) but when i started to track the problem, always happens running azureus. So I look at the logs and I found this:

```

Mar  1 03:09:50 chaplin BUG: soft lockup - CPU#0 stuck for 11s! [java:10670]

Mar  1 03:09:50 chaplin CPU 0:

Mar  1 03:09:50 chaplin Modules linked in: bridge llc w83627ehf

hwmon_vid eeprom fuse snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss

af_packet cpufreq_conservative cpufreq_ondemand cpufreq_powersave

cpufreq_userspace fan button thermal powernow_k8 freq_table processor

nvidia(P) ohci_hcd evdev 8250_pnp irtty_sir sir_dev irda crc_ccitt

parport_pc parport ehci_hcd usbcore snd_hda_intel snd_pcm snd_timer

snd snd_page_alloc 8250 serial_core psmouse pcspkr k8temp forcedeth

i2c_nforce2 i2c_core unix

Mar  1 03:09:50 chaplin Pid: 10670, comm: java Tainted: P

2.6.24-gentoo #1

Mar  1 03:09:50 chaplin RIP: 0010:[<ffffffff803f8b24>]

[<ffffffff803f8b24>] _spin_lock_irqsave+0x12/0x24

Mar  1 03:09:50 chaplin RSP: 0018:ffff81000bdef9c0  EFLAGS: 00000286

Mar  1 03:09:50 chaplin RAX: 0000000000000287 RBX: ffffffff805dd8a8

RCX: 000000000000000f

Mar  1 03:09:50 chaplin RDX: ffff81000bdefa60 RSI: ffff81002aca0d70

RDI: ffff81002aca0da8

Mar  1 03:09:50 chaplin RBP: ffff81003bcf5200 R08: 0000000000000064

R09: ffff810001e001c0

Mar  1 03:09:50 chaplin R10: 0000000000000002 R11: 0000000000000001

R12: 0000000000000000

Mar  1 03:09:50 chaplin R13: ffff81003ede5500 R14: 0000000000000000

R15: ffff81000bdb9770

Mar  1 03:09:50 chaplin FS:  0000000043e67950(0063)

GS:ffffffff8052e000(0000) knlGS:00000000f406bb90

Mar  1 03:09:50 chaplin CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Mar  1 03:09:50 chaplin CR2: 00002aaaaacd3340 CR3: 0000000028c51000

CR4: 00000000000006e0

Mar  1 03:09:50 chaplin DR0: 0000000000000000 DR1: 0000000000000000

DR2: 0000000000000000

Mar  1 03:09:50 chaplin DR3: 0000000000000000 DR6: 00000000ffff0ff0

DR7: 0000000000000400Mar  1 03:09:50 chaplin

Mar  1 03:09:50 chaplin Call Trace:

Mar  1 03:09:50 chaplin [<ffffffff802f6358>] prop_norm_percpu+0x3f/0xea

Mar  1 03:09:50 chaplin [<ffffffff802f65f7>] prop_fraction_percpu+0x3d/0x69

Mar  1 03:09:50 chaplin [<ffffffff802640ec>] get_dirty_limits+0xd0/0x184

Mar  1 03:09:50 chaplin [<ffffffff889f5788>] :fuse:fuse_dev_cleanup+0x133/0x13c

Mar  1 03:09:50 chaplin [<ffffffff8026424f>]

balance_dirty_pages_ratelimited_nr+0xaf/0x2b3

Mar  1 03:09:50 chaplin [<ffffffff8025faaf>]

generic_file_buffered_write+0x524/0x645

Mar  1 03:09:50 chaplin [<ffffffff80247284>] autoremove_wake_function+0x0/0x2e

Mar  1 03:09:50 chaplin [<ffffffff8020a843>] __switch_to+0x26e/0x27d

Mar  1 03:09:50 chaplin [<ffffffff8025ff0e>]

__generic_file_aio_write_nolock+0x33e/0x3a8

Mar  1 03:09:50 chaplin [<ffffffff803f753c>] thread_return+0x3d/0x81

Mar  1 03:09:50 chaplin [<ffffffff8024d973>] get_futex_key+0x82/0x14e

Mar  1 03:09:50 chaplin [<ffffffff8025ffd9>] generic_file_aio_write+0x61/0xc1

Mar  1 03:09:50 chaplin [<ffffffff8025ff78>] generic_file_aio_write+0x0/0xc1

Mar  1 03:09:50 chaplin [<ffffffff8027eb58>] do_sync_readv_writev+0xc0/0x107

Mar  1 03:09:50 chaplin [<ffffffff802f77a3>] __up_read+0x13/0x8a

Mar  1 03:09:50 chaplin [<ffffffff80247284>] autoremove_wake_function+0x0/0x2e

Mar  1 03:09:50 chaplin [<ffffffff8024edb3>] do_futex+0x8a/0xa00

Mar  1 03:09:50 chaplin [<ffffffff8027e9ed>] rw_copy_check_uvector+0x6c/0xdc

Mar  1 03:09:50 chaplin [<ffffffff8027f1a1>] do_readv_writev+0xb2/0x18b

Mar  1 03:09:50 chaplin [<ffffffff802499e1>] ktime_get_ts+0x17/0x48

Mar  1 03:09:50 chaplin [<ffffffff803f7b2c>] mutex_lock+0xd/0x1e

Mar  1 03:09:50 chaplin [<ffffffff8027f696>] sys_writev+0x45/0x6e

Mar  1 03:09:50 chaplin [<ffffffff8020be2e>] system_call+0x7e/0x83

```

Before this, Azureus wasn't working because in an update I removed the blackdown-sdk (blocking issue) and the update installed the sun-jdk-1.6.*. So, when I run azureus, some message like "vms not found" appears. Then, I did:

```
java-config -S sun-sdk

```

And azureus started working. Then, this problems appears (never before I have this problem). So, i tried to install blackdown-sdk, and use it (1.4). The problem was much worst. The entire computer hangs. Now I'm removing sun-sdk, and compiling again azureus to test if this happens again.

Anyone has any ideas why this could happen? The important lines for me are:

```
Mar  1 03:09:50 chaplin Pid: 10670, comm: java Tainted: P

```

and:

```
Mar  1 03:09:50 chaplin [<ffffffff889f5788>] :fuse:fuse_dev_cleanup+0x133/0x13c

```

Because the partition were I download thing it is ntfs using ntfs3g with fuse (outside the kernel, fuse from portage), with ntfs3g compiled with the suid flag.

Regards,

----------

## gentunian

sun-sdk-1.6.0.03 it's a dependency:

chaplin seba # emerge -av azureus

These are the packages that would be merged, in order:

```
Calculating dependencies... done!

[ebuild  N    ] dev-java/sun-jdk-1.6.0.03  USE="X alsa -doc -examples -jce (-nsplugin) -odbc" 0 kB 

[ebuild   R   ] net-p2p/azureus-2.5.0.4-r1  USE="-source" 0 kB 

```

So i think i must use sun-sdk [?]

----------

## irgu

From http://article.gmane.org/gmane.comp.file-systems.ntfs-3g.devel/418

 *Quote:*   

> 
> 
> Thanks to our Gentoo users and Miklos Szeredi, it was found out recently 
> 
> that the FUSE kernel module used from the FUSE software packages (Gentoo 
> ...

 

----------

## gentunian

thanks! I'll try the kernel module.

----------

## brazso

Using "fuse" compiled in the kernel solved your lockup problem? I'm using the latest 2.6.24-r3 (stable branch) kernel but I still have complete system lockups during writing to ntfs partition. At first the touched avidemux application freezes then soon I lose the keyboard and/or the mouse. Emerging ntfs3g still includes the emerge of fuse 2.7.2 (latest in test branch) but it says that it is not used as module due to the found fuse in the kernel. Listmod displays fuse, but I cannot specify its version. Is there a way to display versions of loaded modules?

----------

## gentunian

Yes. Using fuse compiled in the kernel solved my lockup problem. I really forgot this thread, I apologized for that. The problem initiated when azureus was accessing to a ntfs partition using fuse compiled outside the kernel, that is, by the fuse package. The fuse package complains about the module being compiled in the kernel, as you said.

My kernel version is 2.6.24-gentoo-r3 (don't know if that is now stable, when I emerged was masked, but you are saying that it's stable, so maybe it is) and the fuse version is 2.7.2. Nothing to do with, but just for the record my ntfs3g is compiled with the suid flag.

You could use modinfo to gather information about a module, but I think it would not provide the module version information you want though. I've been trying the --dump-modversions option for modprobe, but it seems that it doesn't work.

It's weird that we have the same kernel and the same package version. Did you see the logs? Maybe it's not the same problem.

----------

## brazso

Thanks for your detailed answer, gentunian! To be frankly I found nothing suspicious in the logs after lockups. I checked everything under /var/log directory, especially the current output file of dmesg. Do I have to activate some extra log in the kernel before? After I had various lockups during the usage of avidemux (output went to ntfs partition), I just tried to copy manually a bigger file into my ntfs partition, and it resulted the same lockup. That is why I think ntfs-3g is responsible for the problem.

----------

## irgu

The problem is the FUSE kernel module. Either you need to use the FUSE module in the 2.6.24 kernel or the one in the FUSE 2.7.3 package. The FUSE kernel module in the FUSE 2.7.2 package is broken.

----------

## gentunian

 *Quote:*   

> Do I have to activate some extra log in the kernel before?

 

If you don't have any log package installed, then it's safe to you to use one. I use syslog-ng, just for the fact that it's the one in the installation guide. The very first time I installed gentoo I follow up the guide line by line. And my gentoo has this time running. So, if you don't have any log application running, I recommend to emerge one. To use syslog-ng, you have to emerge it:

```
emerge -av syslog-ng
```

and then put it in the default level:

```
rc-update add syslog-ng default
```

and if you want to start logging now:

```
/etc/init.d/syslog-ng start
```

To see your log continuosly you could do something like this:

```
tail -f /var/log/messages
```

Dmesg output it's not detailed as the messages file. So, if you can reproduce the lockup, after doing it, run the above command to have an instant visual of whats happening on your box. As you can't use the keyboard or even (sometimes) the mouse, I recommend to think in a special layout of your favorite terminal showing you the log.

I almost activated the most debug messages I found interesting in the kernel the times I compiled it. So, if you see some debug option you could find useful, you may activate it. I really just don't know if I could log that due to a debug option activated in the kernel. 

Is mostly sure that your lockup problem is provocated by fuse. ntfs3g uses fuse. As you can see my log, the call trace is invoking to some kind of read-write operations and the fuse module is involved in the trace.

So, as I told you before, if you can reproduce the lockup, reproduce it viewing the logs, or just reproduce it and then reboot and see the logs. 

Note: To safely reboot your computer without pressing the reset button, you could use the system req key. The SysReq key it's the same as the printScreen key. So, if you hold both ALT and SysReq key, then with a combination of the below keys you could:

 R - raw mode keyboard.

 E - terminate all processes

 I - kill all processes

 S - sinc disc

 U - unmount filesystems

 B - reboot

Pressing in the above order it's safe to restart the computer. Note to always hold the 3 keys you press, that is, alt + sysreq + r to enter raw mode, alt + sysreq + e, to terminate all processes, and so on. If the keys doesn't work, you need to compile it in the kernel. You can check this by doing:

```
grep MAGIC <your-config-kernel-file>
```

and if you can see this:

```
CONFIG_MAGIC_SYSRQ=y
```

Then you have enable the feature, if not, you could edit the config file to find that line (something like "CONFIG_MAGIC_SYSRQ is not set"), and change it like the above and compile the kernel, or you could do the traditional way  using make menuconfig, and entering the "kernel hacking" section.

EDIT:  *irgu wrote:*   

> The problem is the FUSE kernel module. Either you need to use the FUSE module in the 2.6.24 kernel or the one in the FUSE 2.7.3 package. The FUSE kernel module in the FUSE 2.7.2 package is broken.

 

Sorry if i'm wrong, but what I understood of what brazso said it's that he compiled the FUSE module in the kernel 2.6.24 and he still has the problem.

----------

## brazso

gentunian> I had installed gentoo to my desktop machine at least 2 years ago so I did not remember which log package was active. I checked it, it's the metalog. In its configuration file I think everything is switched on and I have lot of files under /var/log, e.g. "messages" that you mentioned. However I found nothing inside of the log files which might be related with the lockups. Following your advice I shall try to activate all debug facilities in the kernel, moreover I shall try the ntfs writing with my previous kernel (2.6.23). Thanks for the detailed description of using SysRq button, I always learn something new.

I have just noticed irgu's comment. Theoretically I'm using the compiled fuse module from kernel 2.6.24-r3. Command modinfo fuse confirms that fuse.ko is used and loaded from the actual kernel path. However the output of dmesg contains some message about fuse 2.7.2 as loaded one. I will try a later kernel from the test branch, I'm curious to know which fuse version is used there.

----------

## gentunian

brazso, checking the log for all fuse lines (eg, "cat /var/log/messages | grep fuse") I found that since I compiled the FUSE module in the kernel, the line:

```
Mar  3 18:21:30 chaplin fuse init (API version 7.8)

Mar  3 18:21:30 chaplin fuse distribution version: 2.7.2
```

it's gone. If you see the date of this thread it's consistent with this. After the "Mar 3" date, "Distributed version: 2.7.2" it's no more shown. So, I deduce that that message came from the FUSE module compiled by the fuse package. You can check also that, to see whether you have the module from the package or compiled by the kernel. Maybe you have both, and modprobe still inserting the wrong one, i mean, the one you don't want to be inserted.

This is the line that appears after the compilation and until now:

```
Mar  3 19:08:32 chaplin fuse init (API version 7.9)
```

Regarding to the log level...I was looking to the kernel to find some debug options enabled. In the "kernel hacking" section I've activated "Kernel debug", and if you check below there's one called "Detect soft lockups". You should check that, compile the kernel and try. Also look at the others one. But, I think that with those two you would be great.

Cheers,

----------

## brazso

gentunian> after system boot dmesg now displays exactly the same loaded fuse info (7.8, 2.7.2)  that you included before activating its kernel usage. I'm afraid you are right and the kernel still uses the fuse package despite it is set in the kernel as module however modinfo says its opposite. Did you remove the emerged fuse package? More packages (one of them is ntfs3g) depend on it, so it cannot be removed in my case. Still I can try to embed it (=y) into the kernel instead of modul setting (=m). You have the same kernel version than mine (2.6.24-r3). Could you tell me the size of the fuse.ko file displayed by modinfo fuse? It cannot be different, but who knows   :Smile:  I will check the logs again after setting the kernel debug on, thanks!

----------

## gentunian

Well, if you have two modules of the same name you can located using slocate. You can check the /lib/modules directory. Also, it's safe to unmerge the fuse package, compile the kernel module, and the emerge the fuse package.

BTW, modinfo tells you the path to the module. That doesn't mean that the module it's from the kernel source. So, if you have your slocate updated, you could check for the fuse module doing:

```
slocate fuse.ko
```

If you're in doubts, remove what you find (making a backup if you want). Then, compile the fuse kernel module. You could use it in the kernel, but I prefer "moduling" the things, so I recommend that too.

If you're slocate isn't updated you can updated by running as root 

```
slocate -u
```

If you don't want to use slocate, you can check your /lib/modules directories using find, and remove (or move, backup, or whatever you want) the fuse.ko and then compile it again (from kernel).

I don't remember the procedure I made. But I'm mostly sure that I unmerged the fuse package, compiled the kernel, reboot with the new kernel (sometimes you can't unload modules, so the old module is already loaded and you want to remove it. Anyway, running "mount" command you can see what's using fuse an unmount that, then you should remove the old module), and then emerge the fuse package.

Good luck! and don't forget to tell us how did you do.

----------

## brazso

The proposed slocate command solved the problem, it displayed 2 fuse.ko files under /lib/modules/2.6.24-gentoo-r3.

```
slocate fuse.ko

/lib/modules/2.6.24-gentoo-r3/fs/fuse/fuse.ko

/lib/modules/2.6.24-gentoo-r3/kernel/fs/fuse/fuse.ko

```

The first one belonged to the fuse package. It seems that removment of the package did not erase its fuse.ko. I was so lucky that the system always chose the first one instead of the kernel one. I removed manually the first one, still I have checked that a new emerge sys-fs/fuse does not create it again. By now I get the expected version of fuse in the log moreover the soft lockup has vanished.

```
dmesg | grep fuse

fuse init (API version 7.9)
```

Thanks for the great help, gentunian! I think you can set the status of this topic to solved.

----------

## gentunian

 *brazso wrote:*   

> ... It seems that removment of the package did not erase its fuse.ko.

 

Thats because maybe the module was in use, like I said before.

I'm glad you solved your problem! The really help came by irgu letting us know about the bug, you should say thanks to him.

Regards,

----------

