Laptop doesn't resume from sleep after kernel update from 6.2.x to 6.3.x (F37)

I’m running F37 on a Lenovo ThinkPad P1 Gen 4i. After taking the update to kernel 6.3.x through the regular update process, I started to see random resume failures.

Background: This is my main computer, and I use it for many hours daily. Usually, I suspend it in the evening and resume it the next morning. On rare occasions I leave it on over the night. I reboot it only to start a new kernel, so typically it runs without reboot for 3-4 weeks.

Symptoms:

  • I open the lid, and the power led keeps blinking (normally, it should go steady-on after 1-2 seconds). The dock power led however goes to steady on as expected.
  • Nothing seems to wake up (fans, backlight, external monitor).
  • Pressing keys or short pressing the power button has no effect.
  • The only way to get out of this state seems to be complete power off (by long pressing the power button).

Additional info:

  • I have had this computer for ~4 months, and haven’t seen any single suspend/resume issue on kernel <= 6.2.x. It had been working flawlessly until I updated to 6.3.x.
  • I can’t reproduce the problem (on kernel 6.3.x) immediately after a fresh boot. I did ~10 suspend/resume cycles, with and without the docking station connected, and it worked as expected.
  • Even on kernel 6.3.x, it worked once with my usual suspend in the evening/resume in the morning routine, but it failed the next day.
  • I saw the problem only 3 times total, then I went back to kernel 6.2.x. Since this is my main laptop, it’s very inconvenient to lose state like that over suspend/resume.
  • Suspend mode is configured to “Linux S3” in BIOS.
  • The BIOS is relatively new (updated ~4 months ago when I got the laptop).

Since the resume part happens entirely in firmware/BIOS (at least the early stages of it), I’m thinking whatever the problem is, it actually occurs before the suspend. Unfortunately, since I haven’t found a way to make it wake up, I can’t see the logs before the suspend.

Unfortunately, acpi-related issues seem to be a common phenomenon on 6.3.X.

I suggest to provide some logs.

Provoke the problem, and once nothing works any longer, do as usual a hard reset. Then, immediately at the next boot, get the output of sudo journalctl -k --no-hostname --boot=-1 and let us know: -k → kernel messages, --boot=-1 = then-last boot (so the last boot at the time you enter the command, which will then be the broken one).

Feel free to anonymize content that you consider private (e.g., MAC/IP addresses or so)

Please use an external service or a link to a file to provide the information. Do not paste it here or so.

Thanks! This one may not be ACPI related. I was looking at other similar posts and got inspired by this reply, which suggests that these problems may not be directly related to suspend/resume but instead earlier problems that manifest during suspend/resume. That’s consistent with the behavior I see, where the problem can’t be reproduced on a fresh boot.

If the real problem appears before suspend/resume, it means I might have something in the logs. This is also what you suggested. So, I decided to take a closer look at the kernel logs and found this little rat:

Jun 01 21:11:09 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: #PF: supervisor read access in kernel mode
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: #PF: error_code(0x0000) - not-present page
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: PGD 0 P4D 0 
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: CPU: 6 PID: 35138 Comm: kworker/u32:52 Not tainted 6.3.4-101.fc37.x86_64 #1
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Hardware name: LENOVO 20Y4S1QE15/20Y4S1QE15, BIOS N40ET40W (1.22 ) 02/21/2023
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Workqueue: USBC000:00-con1 ucsi_poll_worker [typec_ucsi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RIP: 0010:ucsi_acpi_async_write+0x30/0x50 [ucsi_acpi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Code: 44 00 00 41 55 49 89 cd 41 54 49 89 d4 55 53 89 f3 e8 d4 7d 09 00 4c 89 e6 89 df 4c 89 ea 48 03 78 10 48 89 c5 e8 a0 2a bf eb <49> 8b 04 24 48 89 ef be 01 00 00 00 48 89 45 50 5b 5d 41 5c 41 5d
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RSP: 0018:ffffb64d423efd60 EFLAGS: 00010282
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RAX: ffffb64d40019002 RBX: 0000000000000002 RCX: 0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb64d40019002
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RBP: ffff938941d06328 R08: ffff938b764d2738 R09: ffff938b764d2720
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: R10: 000000000000000f R11: ffffb64d423efb60 R12: 0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: R13: 0000000000000000 R14: ffff938941d004b8 R15: ffff93894f5a8f08
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: FS:  0000000000000000(0000) GS:ffff93985f580000(0000) knlGS:0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: CR2: 0000000000000000 CR3: 00000003c7022003 CR4: 0000000000f70ee0
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: PKRU: 55555554
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Call Trace:
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  <TASK>
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ucsi_exec_command+0x24b/0x2d0 [typec_ucsi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ucsi_send_command+0x4b/0xe0 [typec_ucsi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ucsi_register_altmodes+0xd5/0x1c0 [typec_ucsi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ucsi_check_altmodes+0x1b/0xa0 [typec_ucsi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ? mutex_lock+0x12/0x30
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ucsi_poll_worker+0x3a/0x110 [typec_ucsi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  process_one_work+0x1c5/0x3c0
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  worker_thread+0x51/0x390
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ? __pfx_worker_thread+0x10/0x10
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  kthread+0xdb/0x110
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ? __pfx_kthread+0x10/0x10
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  ret_from_fork+0x29/0x50
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  </TASK>
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Modules linked in: tls r8153_ecm cdc_ether usbnet r8152 mii rfcomm snd_seq_dummy snd_hrtimer sunrpc bridge stp llc qrtr nft_chain_nat xt_MASQUERADE nf_nat xt_multiport bnep xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag>
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  snd_hwdep videobuf2_vmalloc iwlwifi mei_wdt videobuf2_memops mei_pxp btusb irqbypass snd_seq videobuf2_v4l2 btrtl iTCO_wdt videobuf2_common btbcm snd_seq_device intel_pmc_bxt rapl ee1004 iTCO_vendor_support btintel intel_c>
Jun 01 21:11:09 thinkpad-p1.localdomain kernel:  rtsx_pci sha512_ssse3 typec_ucsi serio_raw ttm nvme_common typec video i2c_hid_acpi i2c_hid wmi pinctrl_tigerlake scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath i2c_dev fuse
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: CR2: 0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: ---[ end trace 0000000000000000 ]---
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RIP: 0010:ucsi_acpi_async_write+0x30/0x50 [ucsi_acpi]
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: Code: 44 00 00 41 55 49 89 cd 41 54 49 89 d4 55 53 89 f3 e8 d4 7d 09 00 4c 89 e6 89 df 4c 89 ea 48 03 78 10 48 89 c5 e8 a0 2a bf eb <49> 8b 04 24 48 89 ef be 01 00 00 00 48 89 45 50 5b 5d 41 5c 41 5d
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RSP: 0018:ffffb64d423efd60 EFLAGS: 00010282
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RAX: ffffb64d40019002 RBX: 0000000000000002 RCX: 0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb64d40019002
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: RBP: ffff938941d06328 R08: ffff938b764d2738 R09: ffff938b764d2720
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: R10: 000000000000000f R11: ffffb64d423efb60 R12: 0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: R13: 0000000000000000 R14: ffff938941d004b8 R15: ffff93894f5a8f08
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: FS:  0000000000000000(0000) GS:ffff93985f580000(0000) knlGS:0000000000000000
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: CR2: 0000000000000000 CR3: 00000003c7022003 CR4: 0000000000f70ee0
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: PKRU: 55555554

Furthermore, it seems to be consistent with the suspend/resume failures. Suspend/resume always works before I hit that bug and always fails after. Also, the bug never occurred on kernel 6.2.

This is the relevant log snippet showing the correlation:

Jun 01 14:44:14 thinkpad-p1.localdomain kernel: Linux version 6.3.4-101.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Sat May 27 15:09:40 UTC 2023
Jun 01 18:11:18 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 01 21:11:06 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 01 21:11:09 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 02 00:18:24 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 02 00:21:35 thinkpad-p1.localdomain kernel: Linux version 6.3.4-101.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Sat May 27 15:09:40 UTC 2023
Jun 02 00:21:41 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 02 08:03:20 thinkpad-p1.localdomain kernel: Linux version 6.3.4-101.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Sat May 27 15:09:40 UTC 2023
Jun 02 08:11:59 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 02 08:12:07 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 02 08:12:40 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 02 08:12:58 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 02 22:35:07 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 03 08:54:33 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 04 22:45:23 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 05 08:22:25 thinkpad-p1.localdomain kernel: ACPI: PM: Waking up from system sleep state S3
Jun 05 08:22:25 thinkpad-p1.localdomain kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jun 05 23:15:44 thinkpad-p1.localdomain kernel: PM: suspend entry (deep)
Jun 06 09:22:15 thinkpad-p1.localdomain kernel: Linux version 6.2.15-200.fc37.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Thu May 11 15:56:33 UTC 2023

The Linux version messages indicate a cold boot (suspend/resume has failed), the PM: suspend entry messages indicate when the system goes to sleep, and the ACPI: PM: Waking up messages indicate when the system resumes cleanly.

I’ll search lore.kernel.org to see if anyone has reported this bug already (and report it if nobody has).

Please don’t (also, lore is a mailing list archive, not the kernel’s immediate bug tracker). If at all, it belongs to our bugzilla. However, you already presume much we don’t know, and based upon that you selected the extracts of the logs. So we cannot go further. There is clearly an issue in the kernel, but you should not take these entries too literally when, where, what the issue is (including the BUG entry). It is correct that an issue can happen much earlier. Maybe indicators can occur already at boot. Indeed, sometimes an error in the kernel can create symptoms much later. So that’s indeed a possibility. But that is one of the reasons why this little extract is not sufficiently indicative. At the moment, I would still focus on something acpi-based (nevertheless, we don’t know this for sure as well!).

Off the cuff, I expect this ends up in a bug report at bugzilla. But this would need the whole log, but before that I think there is something we can already try in advance:

There is a problematic bug in ucsi_acpi, which caused a lot of trouble in another topic and in some bug reports. This issue was solved in 6.3.8. Let’s check if that is related:

Can you please boot 6.3.7 with the parameter module_blacklist=ucsi_acpi ? Does that make a difference? (With this in mind, please also update at least to 6.3.7, if you wanna test if your issue has been already solved, to 6.3.8, which is currently in testing) You could increase the number of kernels that are kept to ensure that 6.2.15 does not get removed.

Also, please provide the requested logs.

Currently running 6.3.7 with module_blacklist=ucsi_acpi. Since I don’t have a way to directly reproduce that bug, I’ll just see how it plays out over the next few days.

In the meantime, I found this:
https://lore.kernel.org/all/20230606115802.79339-1-heikki.krogerus@linux.intel.com/
which leads to:
https://bugzilla.kernel.org/show_bug.cgi?id=217517

The stack trace and code dump look similar (but not identical), so it may (or may not) be the same bug. For now, the patch has made it into linux-next, so it will take a while before it lands in Linus’s tree and gets backported to linux-6.3.y.

I added a comment to that kernel.org BZ (and no, I did not ignore your advice, you just hadn’t posted your reply yet at that time). We’ll see if I get any feedback there.

Thanks again for all the suggestions!

6.3.8 was released, and so far the feedback from users that had an issue that could be mitigated by module_blacklist=ucsi_acpi confirmed that 6.3.8 solved the issue permanently so that they no longer needed module_blacklist=ucsi_acpi.

Be aware that if we have sufficient logs and if thus our kernel people come to the conclusion that a patch that is already upstream will solve the related issue, they can add the patch already in advance to our kernel if some conditions are satisfied.

Thanks for the update. I upgraded to 6.3.8 yesterday, because blacklisting ucsi_acpi doesn’t work that well for me (it seems to have a weird side effect where the external monitor, which is connected to a USB-C dock, stops working after resume).

So far, no issues on 6.3.8 - suspend/resume has been working well, and I haven’t seen the null pointer deref bug. It’s been only one day, so hard to say if it’s really fixed. I will post an update in a few days.

Fedora kernel 6.3.8 does include the upstream fix that I mentioned before. See Merge branch 'fedora-6.3-ucsi_acpi-boot-crash-fix' into 'fedora-6.3' (f2c15688) · Commits · cki-project / kernel-ark · GitLab. It looks like it was backported early, before it made it into Linus’ tree and all the way back into the upstream stable series.

At this point, I am pretty confident that it is the same issue I was seeing, and that upgrading to 6.3.8 fixes it permanently. Thanks again for all the suggestions.