I’m running F37 on a Lenovo ThinkPad P1 Gen 4i. After taking the update to kernel 6.3.x through the regular update process, I started to see random resume failures.
Background: This is my main computer, and I use it for many hours daily. Usually, I suspend it in the evening and resume it the next morning. On rare occasions I leave it on over the night. I reboot it only to start a new kernel, so typically it runs without reboot for 3-4 weeks.
Symptoms:
I open the lid, and the power led keeps blinking (normally, it should go steady-on after 1-2 seconds). The dock power led however goes to steady on as expected.
Nothing seems to wake up (fans, backlight, external monitor).
Pressing keys or short pressing the power button has no effect.
The only way to get out of this state seems to be complete power off (by long pressing the power button).
Additional info:
I have had this computer for ~4 months, and haven’t seen any single suspend/resume issue on kernel <= 6.2.x. It had been working flawlessly until I updated to 6.3.x.
I can’t reproduce the problem (on kernel 6.3.x) immediately after a fresh boot. I did ~10 suspend/resume cycles, with and without the docking station connected, and it worked as expected.
Even on kernel 6.3.x, it worked once with my usual suspend in the evening/resume in the morning routine, but it failed the next day.
I saw the problem only 3 times total, then I went back to kernel 6.2.x. Since this is my main laptop, it’s very inconvenient to lose state like that over suspend/resume.
Suspend mode is configured to “Linux S3” in BIOS.
The BIOS is relatively new (updated ~4 months ago when I got the laptop).
Since the resume part happens entirely in firmware/BIOS (at least the early stages of it), I’m thinking whatever the problem is, it actually occurs before the suspend. Unfortunately, since I haven’t found a way to make it wake up, I can’t see the logs before the suspend.
Unfortunately, acpi-related issues seem to be a common phenomenon on 6.3.X.
I suggest to provide some logs.
Provoke the problem, and once nothing works any longer, do as usual a hard reset. Then, immediately at the next boot, get the output of sudo journalctl -k --no-hostname --boot=-1 and let us know: -k → kernel messages, --boot=-1 = then-last boot (so the last boot at the time you enter the command, which will then be the broken one).
Feel free to anonymize content that you consider private (e.g., MAC/IP addresses or so)
Please use an external service or a link to a file to provide the information. Do not paste it here or so.
Thanks! This one may not be ACPI related. I was looking at other similar posts and got inspired by this reply, which suggests that these problems may not be directly related to suspend/resume but instead earlier problems that manifest during suspend/resume. That’s consistent with the behavior I see, where the problem can’t be reproduced on a fresh boot.
If the real problem appears before suspend/resume, it means I might have something in the logs. This is also what you suggested. So, I decided to take a closer look at the kernel logs and found this little rat:
Furthermore, it seems to be consistent with the suspend/resume failures. Suspend/resume always works before I hit that bug and always fails after. Also, the bug never occurred on kernel 6.2.
This is the relevant log snippet showing the correlation:
Jun 01 14:44:14 thinkpad-p1.localdomain kernel: Linux version 6.3.4-101.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Sat May 27 15:09:40 UTC 2023
Jun 01 18:11:18 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 01 21:11:06 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 02 00:18:24 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 02 00:21:35 thinkpad-p1.localdomain kernel: Linux version 6.3.4-101.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Sat May 27 15:09:40 UTC 2023
Jun 02 00:21:41 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 02 08:03:20 thinkpad-p1.localdomain kernel: Linux version 6.3.4-101.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Sat May 27 15:09:40 UTC 2023
Jun 02 08:11:59 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 02 08:12:07 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 02 08:12:40 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 02 08:12:58 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 02 22:35:07 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 03 08:54:33 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 04 22:45:23 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 05 08:22:25 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 05 08:22:25 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 05 23:15:44 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 06 09:22:15 thinkpad-p1.localdomain kernel: Linux version 6.2.15-200.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Thu May 11 15:56:33 UTC 2023
The Linux version messages indicate a cold boot (suspend/resume has failed), the PM: suspend entry messages indicate when the system goes to sleep, and the ACPI: PM: Waking up messages indicate when the system resumes cleanly.
I’ll search lore.kernel.org to see if anyone has reported this bug already (and report it if nobody has).
Please don’t (also, lore is a mailing list archive, not the kernel’s immediate bug tracker). If at all, it belongs to our bugzilla. However, you already presume much we don’t know, and based upon that you selected the extracts of the logs. So we cannot go further. There is clearly an issue in the kernel, but you should not take these entries too literally when, where, what the issue is (including the BUG entry). It is correct that an issue can happen much earlier. Maybe indicators can occur already at boot. Indeed, sometimes an error in the kernel can create symptoms much later. So that’s indeed a possibility. But that is one of the reasons why this little extract is not sufficiently indicative. At the moment, I would still focus on something acpi-based (nevertheless, we don’t know this for sure as well!).
Off the cuff, I expect this ends up in a bug report at bugzilla. But this would need the whole log, but before that I think there is something we can already try in advance:
There is a problematic bug in ucsi_acpi, which caused a lot of trouble in another topic and in some bug reports. This issue was solved in 6.3.8. Let’s check if that is related:
Can you please boot 6.3.7 with the parameter module_blacklist=ucsi_acpi ? Does that make a difference? (With this in mind, please also update at least to 6.3.7, if you wanna test if your issue has been already solved, to 6.3.8, which is currently in testing) You could increase the number of kernels that are kept to ensure that 6.2.15 does not get removed.
Currently running 6.3.7 with module_blacklist=ucsi_acpi. Since I don’t have a way to directly reproduce that bug, I’ll just see how it plays out over the next few days.
The stack trace and code dump look similar (but not identical), so it may (or may not) be the same bug. For now, the patch has made it into linux-next, so it will take a while before it lands in Linus’s tree and gets backported to linux-6.3.y.
I added a comment to that kernel.org BZ (and no, I did not ignore your advice, you just hadn’t posted your reply yet at that time). We’ll see if I get any feedback there.
6.3.8 was released, and so far the feedback from users that had an issue that could be mitigated by module_blacklist=ucsi_acpi confirmed that 6.3.8 solved the issue permanently so that they no longer needed module_blacklist=ucsi_acpi.
Be aware that if we have sufficient logs and if thus our kernel people come to the conclusion that a patch that is already upstream will solve the related issue, they can add the patch already in advance to our kernel if some conditions are satisfied.
Thanks for the update. I upgraded to 6.3.8 yesterday, because blacklisting ucsi_acpi doesn’t work that well for me (it seems to have a weird side effect where the external monitor, which is connected to a USB-C dock, stops working after resume).
So far, no issues on 6.3.8 - suspend/resume has been working well, and I haven’t seen the null pointer deref bug. It’s been only one day, so hard to say if it’s really fixed. I will post an update in a few days.
At this point, I am pretty confident that it is the same issue I was seeing, and that upgrading to 6.3.8 fixes it permanently. Thanks again for all the suggestions.