# Kernel fails to resume from suspend

## Bircoph

Hello,

after update from kernel-3.5.7 to 3.9.9 resume fails with the following message:

 *Quote:*   

> 
> 
> Processes could not be frozen, cannot continue resuming.
> 
>     You can now boot the system and lose the saved state
> ...

 

This message is given by the resume application from initrd image.

I use vanilla kernels and sys-power/suspend-1.0.

kernel-3.9.9 config: http://bpaste.net/show/113214/

diff between 3.5.7 and 3.9.9: http://bpaste.net/show/113215/

Any clues?

Update:

Digging into suspend-utils code shows that the following ioctl fails on "/dev/snapshot":

```

ioctl(dev, _IO(3, 1), 0);

```

Have no idea why, though.

Update 2:

I disabled CONFIG_LOCKUP_DETECTOR, because I suspected it for creation of unfreezable processes, but with no luck.

Another guess was that userspace is too old for this kernel (last full update was in October 2012), so I updated linux-headers, glibc and all @system to the latest ~x86 versions together with suspend rebuild. But no luck either.

Update 3:

I modified suspend code to see errno, so freeze on /dev/snapshot fails due to

```

Error 11: Resource temporarily unavailable

```

My guess is that either freeze ioctl is wrong or is misused due to some API changes in the kernel.

----------

## Hu

Does v3.10 work?  What was the first kernel release to fail?  Did this regress as part of a stable series or in a major release?

----------

## Bircoph

 *Hu wrote:*   

> Does v3.10 work?  What was the first kernel release to fail?  Did this regress as part of a stable series or in a major release?

 

I have not tested kernel versions aside from 3.5.7 and 3.9.9 yet. I will do in a while. This host has Atom N270 CPU and takes a long time to compile a kernel. To make it worse I'll go in a journey tomorrow with this messed up system and it is likely I'll not be able to test it properly next two weeks.

As for now I tried I tried suspend from the latest git. This required some patching in order to build the package, though changes were trivial.

Results are disastrous: in the latest git even s2ram is broken — system immediately wakes up after being put to sleep. I can bisect what change causes this issue, but this will take time I don't have now.

Yet another problem that I have absolutely no alternatives to suspend package: I need hibernation with image encryption support; the only other project capable of this I know of is tuxonice, but it requires a heavily patched kernel and I can't afford this.

Meanwhile I'll try to double-check kernel.

1. On each major kernel update I reconfigure kernel thoroughly (to account new features and to meet requirements of my userland tasks). So to eliminate possibility of misconfigured kernel on my side I'll use kernels configured (mostly) with make silentconfig during testing, this applies to 3.9.9 as well.

2. I'll try to bisect kernel version causing this stuff.

3. If succeeded with above, I'll try to find a kernel patch causing this issue, though it will be hard to impossible if breakage occurred during major update. I think, LKML is my friend then.

Update:

make silentoldconfig is not enough, kernel changes are too severe, so I'm using make olddefconfig.

----------

## Bircoph

Hi,

I tried 3.9.9 with default options (olddefconfig on 3.5.7 config) — the problem is still here. With 3.10.0 and 3.7.10 it still fails.

So the problem is quite old at lies somewhere between 3.5 and 3.7. I wonder why I'm the only one to catch this issue: suspend is not the rarest application to use and kernels affected are out in the wild for a long time. Maybe this is hardware related issue. Device is EeePC 1000H with memory upgraded to 2GB and HDD replaced to 750GB Seagate drive.

----------

## Hu

I have used suspend successfully on kernels where it failed for you, so this is not a general suspend failure.  Is there anything in dmesg at the time of the failure?

----------

## jimmij

 *Bircoph wrote:*   

> 
> 
> kernel-3.9.9 config: http://bpaste.net/show/113214/
> 
> 

 

Try to set "Default resume partition" under "Power Management and ACPI options" (PM_STD_PARTITION).

----------

## Bircoph

Hello,

 *Hu wrote:*   

> I have used suspend successfully on kernels where it failed for you, so this is not a general suspend failure.
> 
> 

 

Yeah, seems like something hardware or my kernel config related.

Is there anything in dmesg at the time of the failure?[/quote]

Nothing out of order, PM messages are the same as for working 3.5.7 version.

But you got me the right idea: I can debug this using CONFIG_PM_DEBUG as described in Documentation/power/basic-pm-debugging.txt

*****************

Meanwhile I bisected kernel a bit further: 3.6.11 works fine, with 3.7.0 I have a double failure here: kernel (or resume application) hangs before it starts resume:

```

RAMDISK: gzip image found at block 0

EXT4-fs (ram0): couldn't mount as ext3 due to feature incompatibilities

EXT4-fs (ram0): couldn't mount as ext2 due to feature incompatibilities

EXT4-fs (ram0): mounted filesystem without journal. Opts: (null)

VFS: Mounted root (ext4 filesystem) on device 1:0

```

It's a fun thing though: kernel is able to reboot on C-A-Del though. EXT4-fs messages are normal, because I haven't ext2 or ext3 modules, but ext4 module is configured to support both ext2 and ext3 (this saves space), so it tries old versions first.

3.7.10 with the same kernel config is able to proceed further (to load image, but fails to freeze processes like described in the first post).

The latest 3.10.5 still fails (original failure).

Also I tried to disable watchdog subsystem in the kernel (assuming it can be a culprit for a failed freeze), but with no luck.

----------

## Bircoph

Hi,

 *jimmij wrote:*   

>  *Bircoph wrote:*   
> 
> kernel-3.9.9 config: http://bpaste.net/show/113214/
> 
>  
> ...

 

That doesn't help and as you can see from my original post, suspend image is found and loaded, but some processes can't be frozen, thus thaw fails.

----------

## Bircoph

 *Hu wrote:*   

> I have used suspend successfully on kernels where it failed for you, so this is not a general suspend failure.

 

Have you used s2disk tool or any other means to suspend?

I found a quite astonishing result: when I suspend to disk manually via kernel interface as described in /Documentation/power/basic-pm-debugging.txt, resume works fine. What I did:

1) I completely disabled s2disk resume utility (removed resume initrd image from kernel args and used usual resume option instead).

2) # echo disk > /sys/power/state

So the problem is either in resume application (/usr/lib/suspend/resume which is copied to initrd) or in the kernel support for userspace suspend. All suspend tests proposed in kernel docs work fine.

Assuming the bug is caused by extra functionality in s2disk's resume, I disabled image encryption, compression, early writeout and threading as well as removed threadirqs kernel option. No luck again.

----------

## Hu

Yes, I used s2disk successfully.

----------

## Bircoph

I found a recent LKML thread with somewhat similar but not the same bug: http://www.spinics.net/lists/linux-nfs/msg38160.html . I tried the latest patch proposed there on top of 3.10.5: http://www.spinics.net/lists/linux-nfs/binFIQj2w6Yy9.bin , but with no luck again.

ATM I'm trying to bisect both failures (maybe the second point is an unappropriate fix for the first bug). This will take a long time, in theory I can automate bisection, but development and testing of automation system may take a longer time than bisection itself. The second point should be easier to find because of smaller differences between kernel patch versions of the same branch and it may contain a hint to the first point.

----------

## Bircoph

Hello, some good news here.

Kernel bisection was successful (perhaps I should have used if right from the start: it saved a lot of time compared to emerge kernel-version && make ...).

Commit which causes both bugs is:

```

commit ba4df2808a86f8b103c4db0b8807649383e9bd13

Author: Al Viro <viro@zeniv.linux.org.uk>

Date:   Tue Oct 2 15:29:10 2012 -0400

    don't bother with kernel_thread/kernel_execve for launching linuxrc

    

    exec_usermodehelper_fns() will do just fine...

    

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

```

looks like exec_usermodehelper_fns() doesn't do just fine  :Smile: 

Second bug was fixed by commit:

```

commit f0de17c0babe7f29381892def6b37e9181a53410

Author: Al Viro <viro@zeniv.linux.org.uk>

Date:   Sat Jan 19 13:29:54 2013 -0500

    make sure that /linuxrc has std{in,out,err}

    

    commit 43b16820249396aea7eb57c747106e211e54bed5 upstream.

    

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

    Cc: Barry Davis <Barry-Davis@stormagic.com>

    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

```

No wonder now why linuxrc was hanging: pure resume application had no access to standard I/O streams.

The problem now is that ba4df2808a86f8b103c4db0b8807649383e9bd13 revert is untrivial: there are too many changes made and if I revert all of them, this broke kernel compilation.

At least now I know where the problem lies. What is left is to find out how to fix it. Perhaps LKML is my friend here.

----------

## TomWij

 *Quote:*   

> -	pid = kernel_thread(do_linuxrc, "/linuxrc", SIGCHLD);
> 
> -	if (pid > 0)
> 
> -		while (pid != sys_wait4(-1, NULL, 0, NULL))
> ...

 

Revert just this change in the second function; if it does not compile, you'll need to fork the functions.

----------

## Bircoph

Hello,

 *TomWij wrote:*   

> 
> 
> Revert just this change in the second function; if it does not compile, you'll need to fork the functions.

 

Reverting is not enough, because do_linuxrc function is needed for the call and it requires global vars old_fd, root_fd, they require other code changes... 

So I went somewhat another way: I forward-ported pre-fail commit and fixed compilation failure cause by internal API change: kernel_kexec was replaced by sys_exec, that's all and was quite simple  :Wink: . Patched kernel works fine for me! I finally did it! Thank you all for the help. Patch is available here: http://bpaste.net/show/122288/

Now I'm going to refine and report all this stuff on LKML, perhaps on linux-kernel with CC to linux-pm and developers of both uswsusp and kernel patch in question. Of course current patch is just a workaround and proper fix is to be found by the devs yet, anyway my kernel works fine again! hurray!

P.S. I'm not closing this bug until a proper fix will be committed to the kernel.

----------

## TomWij

This is just a forum topic and not a bug, if you can then please take it to https://bugs.gentoo.org and https://bugzilla.kernel.org such that this can be tracked; on LKML this might not be responded to under the form of a full revert or a bug, unless they are more in the form of a discussion patch. So, you'll want to try smaller patches to see which line makes the actual difference between working and being broken; so, that's why I suggested to just test that small bit. Of course you'll need to forward port what's missing, but at least you don't do a full revert that way. So, if you can, please do so such that we have more success on getting this patched upstream; if not, please file the bug such that we (the Gentoo Kernel team) can track this and then we can try writing some more minimal patches for you to try.

----------

## Bircoph

 *TomWij wrote:*   

> This is just a forum topic and not a bug, if you can then please take it to https://bugs.gentoo.org

 

And the first thing I'll be asked to will be to try with Gentoo-sources  :Smile: 

If you want, I can duplicate LKML bugreport on Gentoo bugzilla of course.

 *Quote:*   

> 
> 
> and https://bugzilla.kernel.org such that this can be tracked; on LKML this might not be responded to under the form of a full revert or a bug, unless they are more in the form of a discussion patch.

 

This is strange. Less than a year ago I reported a bug on kernel's bugzilla and was asked to write to LKML: https://bugzilla.kernel.org/show_bug.cgi?id=48081 . And the problem was finally solved! Though bug was very difficult to reproduce and I had not the slightest idea what part of the kernel is messed up.

 *Quote:*   

> 
> 
> So, you'll want to try smaller patches to see which line makes the actual difference between working and being broken; so, that's why I suggested to just test that small bit. Of course you'll need to forward port what's missing, but at least you don't do a full revert that way. So, if you can, please do so such that we have more success on getting this patched upstream; if not, please file the bug such that we (the Gentoo Kernel team) can track this and then we can try writing some more minimal patches for you to try.

 

I agree with you completely. The only problem is that I can't see a way how to split commit ba4df2808a86f8b103c4db0b8807649383e9bd13 to smaller parts.

I'll try to analyze following commits tomorrow. Maybe patch itself may be made smaller, but only just for a bit. Either way I believe that proper solution will be to not revert it at all and to fix what is broken in the original commit, it is also possible that nothing was broken and this change just triggered a dormant bug somewhere in another code.

LKML report is here: http://marc.info/?l=linux-kernel&m=137633669228353

----------

## TomWij

 *Bircoph wrote:*   

> And the first thing I'll be asked to will be to try with Gentoo-sources 

 

No, we won't ask you that, because we don't patch this piece of code. What we instead do is ask you to run the latest kernels to see if it has already been fixed.

 *Bircoph wrote:*   

> If you want, I can duplicate LKML bugreport on Gentoo bugzilla of course.

 

Yes, feel free to do so; basically, we want to be aware of bugs so we know which releases have certain bugs our users are experiencing and which might not. Certain bugs can block stabilization of a kernel; so, if such a bug is missing we might miss its existence and stabilize regardless.

 *Bircoph wrote:*   

> This is strange. Less than a year ago I reported a bug on kernel's bugzilla and was asked to write to LKML: https://bugzilla.kernel.org/show_bug.cgi?id=48081 . And the problem was finally solved! Though bug was very difficult to reproduce and I had not the slightest idea what part of the kernel is messed up.

 

That's a specific list, it is different than the main list; while you can try to figure out which specific list might help you more specific, you might want to consider filing a bug instead of the high volume main kernel list.  

 *Bircoph wrote:*   

> I agree with you completely. The only problem is that I can't see a way how to split commit ba4df2808a86f8b103c4db0b8807649383e9bd13 to smaller parts.

 

Hmm, okay, will try to later when I get the time to revert this specific bit myself and try to write a more minimal patch.

 *Bircoph wrote:*   

> I'll try to analyze following commits tomorrow. Maybe patch itself may be made smaller, but only just for a bit. Either way I believe that proper solution will be to not revert it at all and to fix what is broken in the original commit, it is also possible that nothing was broken and this change just triggered a dormant bug somewhere in another code.

 

Yeah, could very well be that it is from different code; but we probably need to first understand which specific part fails to get a clue about what the actual problem really is.

 *Bircoph wrote:*   

> LKML report is here: http://marc.info/?l=linux-kernel&m=137633669228353

 

Marked in my mail client to watch this thread; that way, you don't need to explicitly CC us. I see you have CC-ed the relevant person and list as well; so, you increased your chances on getting reply. Okay, let's see if this works out before we file a bug at upstream kernel bugzilla.

----------

## toralf

 *Bircoph wrote:*   

> (perhaps I should have used if right from the start: it saved a lot of time compared to emerge kernel-version && make ...)

 I switched from the ebuild to the git tree too - for various reasons. You've just to add a line to /etc/portage/packages.provided with an appropriate/arbitrary kernel version - that's all.

----------

## Bircoph

Hello,

 *TomWij wrote:*   

>  What we instead do is ask you to run the latest kernels to see if it has already been fixed.

 

At least with 3.10.7 and 3.11-rc5 the problem is still here.

 *Bircoph wrote:*   

> 
> 
> Yes, feel free to do so; basically, we want to be aware of bugs so we know which releases have certain bugs our users are experiencing and which might not. Certain bugs can block stabilization of a kernel; so, if such a bug is missing we might miss its existence and stabilize regardless.
> 
> 

 

Done. See bug 481344.

 *Quote:*   

> 
> 
> Yeah, could very well be that it is from different code; but we probably need to first understand which specific part fails to get a clue about what the actual problem really is.
> 
> 

 

Looks like problem is in the call_usermodehelper_fns(), everything else in ba4df2808a86f8b103c4db0b8807649383e9bd13 is just I/O and wrappers.

 *Bircoph wrote:*   

> 
> 
> Marked in my mail client to watch this thread; that way, you don't need to explicitly CC us. I see you have CC-ed the relevant person and list as well; so, you increased your chances on getting reply. Okay, let's see if this works out before we file a bug at upstream kernel bugzilla.

 

Yeah, the problem with current bug is that there is no appropriate list for it. I CC'ed linux-pm because pm tool is affected, but the real problem is not in the tool, but somewhere in the kernel threading/freezing. Hopefully author of original commit may help. If there will be no reply in a week after initial mail, I'll open a bug on kernel's bugzilla.

----------

## Bircoph

Hello,

 *toralf wrote:*   

> I switched from the ebuild to the git tree too - for various reasons. You've just to add a line to /etc/portage/packages.provided with an appropriate/arbitrary kernel version - that's all.

 

And how do you handle linux headers? If you use sys-kernel/linux-headers, you may be out of sync from your custom kernel. And you can't just do make headers_install because this will collide with glibc headers.

----------

## toralf

 *Bircoph wrote:*   

> Hello,
> 
>  *toralf wrote:*   I switched from the ebuild to the git tree too - for various reasons. You've just to add a line to /etc/portage/packages.provided with an appropriate/arbitrary kernel version - that's all. 
> 
> And how do you handle linux headers? If you use sys-kernel/linux-headers, you may be out of sync from your custom kernel. And you can't just do make headers_install because this will collide with glibc headers.

 It looks like an installed vanilla kernel, isn't it ? :

```
tfoerste@n22 ~ $ eix -I sys-kernel/linux-headers

[I] sys-kernel/linux-headers

     Available versions:  2.4.33.3^bs ~2.4.36^bs 3.1^bs ~3.2-r1^bs ~3.3^bs 3.4^bs ~3.4-r1^bs ~3.4-r2^bs ~3.5^bs 3.6^bs 3.7^bs ~3.8^bs ~3.9^bs

     Installed versions:  3.7^bs(06:18:43 PM 07/16/2013)

     Homepage:            http://www.kernel.org/ http://www.gentoo.org/

     Description:         Linux system headers

tfoerste@n22 ~ $ cat /etc/portage/profile/package.provided

#       package.provided

#

sys-kernel/vanilla-sources-3.7.9

```

----------

## Hu

You can have linux-headers newer relative to your kernel when you use the Gentoo-provided kernel sources, too.  The package manager does not force you to build or boot any of the installed kernels.

----------

## toralf

Just FWIW Peter Hurley from stable kernel ML pointed me to your thread at the LKML - I'm suffering from a s2disk issue too - its erratic here, seems to be already in 3.10 itself and not just in 3.10.x and it is annoying - still investigating here at my ThinkPad T420.

----------

## Bircoph

 *toralf wrote:*   

> It looks like an installed vanilla kernel, isn't it ? :[code]tfoerste@n22 ~ $ eix -I sys-kernel/linux-headers
> 
> 

 

I understand package.provided trick, it's not an issue. The problem is that if kernel headers are too different from running kernel, nasty problems will appear, e.g. I had bug 263497 some time ago. And Gentoo kernel team is sometimes behind vanilla kernel with linux-headers.

Another problem is that linux-headers are used during glibc build (and strictly speaking by some other packages too) and if glibc was build with kernel-headers older than running kernel, it may not utilize all features of a running kernel.

 *Hu wrote:*   

> You can have linux-headers newer relative to your kernel when you use the Gentoo-provided kernel sources, too.  The package manager does not force you to build or boot any of the installed kernels.

 

Yes I can, but this leads to big troubles: in bug 263497 I had exactly the same situation: old kernel and new linux-headers.

----------

## Bircoph

 *toralf wrote:*   

> Just FWIW Peter Hurley from stable kernel ML pointed me to your thread at the LKML - I'm suffering from a s2disk issue too - its erratic here, seems to be already in 3.10 itself and not just in 3.10.x and it is annoying - still investigating here at my ThinkPad T420.

 

Do you have the same trouble with resume or something else? Also please note that bug I'm suffering from appeared in 3.7_rc1. And take a look on this related bug: http://www.spinics.net/lists/linux-nfs/msg38160.html Maybe proposed solution will help you.

Anyway its good to hear that bug is noticed on LKML.

----------

## toralf

 *Bircoph wrote:*   

>  *toralf wrote:*   Just FWIW Peter Hurley from stable kernel ML pointed me to your thread at the LKML - I'm suffering from a s2disk issue too - its erratic here, seems to be already in 3.10 itself and not just in 3.10.x and it is annoying - still investigating here at my ThinkPad T420. 
> 
> Do you have the same trouble with resume or something else? Also please note that bug I'm suffering from appeared in 3.7_rc1. And take a look on this related bug: http://www.spinics.net/lists/linux-nfs/msg38160.html Maybe proposed solution will help you.
> 
> Anyway its good to hear that bug is noticed on LKML.

 I read that and tried all those mentioned commit to revert on top of 3.10.7 - no success. My current attempt to bisect it between 3.9 and 3.10 was not very successful. As an upper limit I'm now at commit 7da052b. That's why I'm wondering if it could be a long standing hidden bug/feature in the kernel and just triggered by a user space tool change here in Gentoo land. But I do use a straight vanilla kernel and suspend just via self-scripted "echo 1 > /sys/power/state" so not very likely, or ?

And I used s2disk w/o any problems in 3.8.x and 3.9.x AFAICR. What makes bisecting worse it that I need to check every commit 3 times and more to trust "git bisect good".

[/b]Last edited by toralf on Sun Aug 18, 2013 5:29 pm; edited 1 time in total

----------

## Hu

Using an old linux-headers with a new kernel deprives you of features, but is safe.  Using a new linux-headers with an old kernel should work, but as you say, it sometimes exposes bugs.  However, my point was that you have all these risks even if you use the Gentoo-provided kernel ebuilds, because nothing ensures that you will actually build and boot a kernel even once the Gentoo kernel team marks it as stable.  It is always the responsibility of the system administrator to keep the running kernel current, regardless of how the sources for that kernel are obtained.

----------

## toralf

I don't think that my issue is linux header related. Currently I'm convinced, that the issue is related somehow to to X11. BTW I stopped bisecting in favour of looking into user space tools.

Currently I'm arguing about the xorg server and MESA.

----------

## Bircoph

Hello,

after two weeks of silence on LKML I checked that it is still an issuet with 3.11-rc7/3.10.9, bumped mail list thread and created a kernel bug 60802. I hope this will attract some attention.

toralf: Have you fixed your problem?

----------

## toralf

 *Bircoph wrote:*   

> Hello,
> 
> after two weeks of silence on LKML I checked that it is still an issuet with 3.11-rc7/3.10.9, bumped mail list thread and created a kernel bug 60802. I hope this will attract some attention.
> 
> toralf: Have you fixed your problem?

 no - unfortunately not.

And just applying your patch (https://bugzilla.kernel.org/attachment.cgi?id=107331&action=diff)  doesn't solved it. The behaviour was however marginally "better" - after wakeup the keyboard commands were at least echoed on the command line - but not executed. The system just hang at the same point.

----------

## toralf

But fwiw 3.11-rc7+ works fine - I'll switch to it from 3.10.x soon I think.

----------

## devsk

Wow! You did so much work and nobody responded anywhere. Shows the state of power management in Linux... :Smile: 

Anyway, I am running into the same issue in 3.12.4. I have configured pm-utils to use 'kernel' instead of s2disk. I am ok with it.

----------

