Kernel: BUG: sleeping function called from invalid context at mm/slab.h:567

pdestefa · May 28, 2020, 5:33am

I get the following error every time my monitor sleeps/wakes which is also when I lock/unlock the system.

May 25 21:49:22 <hostname> kernel: amdgpu 0000:08:00.0: RAS: optional ras ta ucode is not available
May 25 21:49:22 <hostname> kernel: BUG: sleeping function called from invalid context at mm/slab.h:567
May 25 21:49:22 <hostname> kernel: Call Trace:
May 25 21:49:22 <hostname> kernel:  dump_stack+0x8b/0xc8
May 25 21:49:22 <hostname> kernel:  ___might_sleep.cold+0xb6/0xc6
May 25 21:49:22 <hostname> kernel:  kmem_cache_alloc_trace+0x1ea/0x230
May 25 21:49:22 <hostname> kernel:  ? dcn20_clock_source_create+0x34/0x90 [amdgpu]
May 25 21:49:22 <hostname> kernel:  dcn20_clock_source_create+0x34/0x90 [amdgpu]
May 25 21:49:22 <hostname> kernel:  dcn20_resource_construct+0x175/0xae0 [amdgpu]
May 25 21:49:22 <hostname> kernel:  ? rcu_read_lock_sched_held+0x57/0x90
May 25 21:49:22 <hostname> kernel:  ? trace_kmalloc+0xf2/0x120
May 25 21:49:22 <hostname> kernel:  ? kmem_cache_alloc_trace+0x11a/0x230
May 25 21:49:22 <hostname> kernel:  ? dcn20_create_resource_pool+0x25/0x60 [amdgpu]
May 25 21:49:22 <hostname> kernel:  dcn20_create_resource_pool+0x3c/0x60 [amdgpu]
May 25 21:49:22 <hostname> kernel:  dc_create_resource_pool+0x14b/0x150 [amdgpu]
May 25 21:49:22 <hostname> kernel:  dc_create+0x1ef/0x6e0 [amdgpu]
May 25 21:49:22 <hostname> kernel:  ? kmem_cache_alloc_trace+0x11a/0x230
May 25 21:49:22 <hostname> kernel:  amdgpu_dm_init.isra.0+0x17c/0x1e0 [amdgpu]
May 25 21:49:22 <hostname> kernel:  ? lockdep_hardirqs_on+0x11e/0x1b0
May 25 21:49:22 <hostname> kernel:  dm_hw_init+0xe/0x20 [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_device_init.cold+0x165f/0x1a7d [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_driver_load_kms+0x5c/0x200 [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_pci_probe+0xf4/0x180 [amdgpu]
May 25 21:49:22 <hostname> kernel:  local_pci_probe+0x42/0x80
May 25 21:49:22 <hostname> kernel:  pci_device_probe+0xd9/0x190
May 25 21:49:22 <hostname> kernel:  really_probe+0x167/0x410
May 25 21:49:22 <hostname> kernel:  driver_probe_device+0xb6/0x100
May 25 21:49:22 <hostname> kernel:  device_driver_attach+0xa8/0xb0
May 25 21:49:22 <hostname> kernel:  __driver_attach+0x8c/0x150
May 25 21:49:22 <hostname> kernel:  ? device_driver_attach+0xb0/0xb0
May 25 21:49:22 <hostname> kernel:  ? device_driver_attach+0xb0/0xb0
May 25 21:49:22 <hostname> kernel:  bus_for_each_dev+0x67/0x90
May 25 21:49:22 <hostname> kernel:  bus_add_driver+0x12e/0x1f0
May 25 21:49:22 <hostname> kernel:  driver_register+0x8b/0xe0
May 25 21:49:22 <hostname> kernel:  ? 0xffffffffc0d49000
May 25 21:49:22 <hostname> kernel:  do_one_initcall+0x69/0x350
May 25 21:49:22 <hostname> kernel:  ? kmem_cache_alloc_trace+0x11a/0x230
May 25 21:49:22 <hostname> kernel:  ? do_init_module+0x23/0x260
May 25 21:49:22 <hostname> kernel:  do_init_module+0x5c/0x260
May 25 21:49:22 <hostname> kernel:  __do_sys_init_module+0x162/0x190
May 25 21:49:22 <hostname> kernel:  do_syscall_64+0x5c/0xa0
May 25 21:49:22 <hostname> kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xb3
May 25 21:49:22 <hostname> kernel: RIP: 0033:0x7f87385fb40e
May 25 21:49:22 <hostname> kernel: Code: 48 8b 0d 8d 0a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5a 0a 0c 00 f7 d8 64 89 01 48
May 25 21:49:22 <hostname> kernel: RSP: 002b:00007ffdf88b1948 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
May 25 21:49:22 <hostname> kernel: RAX: ffffffffffffffda RBX: 00005573b33c7db0 RCX: 00007f87385fb40e
May 25 21:49:22 <hostname> kernel: RDX: 00007f873825595d RSI: 00000000009f5766 RDI: 00005573b3cd4980
May 25 21:49:22 <hostname> kernel: RBP: 00005573b3cd4980 R08: 00005573b33effa0 R09: 00007ffdf88b0f56
May 25 21:49:22 <hostname> kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 0000000000000000
May 25 21:49:22 <hostname> kernel: R13: 00007f873825595d R14: 00005573b33a7fd0 R15: 00005573b33cc8b0
May 25 21:49:22 <hostname> kernel: [drm] Display Core initialized with v3.2.76!
May 25 21:49:22 <hostname> kernel: BUG: key ffff92d560a99148 has not been registered!
May 25 21:49:22 <hostname> kernel: ------------[ cut here ]------------
May 25 21:49:22 <hostname> kernel: fbcon: Taking over console
May 25 21:49:22 <hostname> kernel: DEBUG_LOCKS_WARN_ON(1)
May 25 21:49:22 <hostname> kernel: WARNING: CPU: 13 PID: 646 at kernel/locking/lockdep.c:4141 lockdep_init_map_waits+0x182/0x210
May 25 21:49:22 <hostname> kernel: Modules linked in: fjes(-) amdgpu(+) raid10 crct10dif_pclmul crc32_pclmul crc32c_intel amd_iommu_v2 gpu_sched ttm ghash_clmulni_intel drm_kms_helper cec drm igb ccp dca i2c_algo_bit uas usb_storage wmi pinctrl_amd br_netfilter bridge stp llc fuse
May 25 21:49:22 <hostname> kernel: CPU: 13 PID: 646 Comm: systemd-udevd Tainted: G        W        --------- ---  5.7.0-0.rc6.20200522git051143e1602d.1.fc32.x86_64 #1
May 25 21:49:22 <hostname> kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS PRO WIFI/B450 AORUS PRO WIFI-CF, BIOS F50 11/27/2019
May 25 21:49:22 <hostname> kernel: RIP: 0010:lockdep_init_map_waits+0x182/0x210
May 25 21:49:22 <hostname> kernel: Code: 00 85 c0 0f 84 74 ff ff ff 8b 3d d9 54 bf 01 85 ff 0f 85 66 ff ff ff 48 c7 c6 59 27 3c b2 48 c7 c7 6d bb 36 b2 e8 63 78 f8 ff <0f> 0b e9 4c ff ff ff e8 a2 53 46 00 85 c0 74 21 44 8b 1d a7 54 bf
May 25 21:49:22 <hostname> kernel: RSP: 0018:ffffbd5940f57928 EFLAGS: 00010292
May 25 21:49:22 <hostname> kernel: RAX: 0000000000000016 RBX: 0000000000000000 RCX: 0000000000000000
May 25 21:49:22 <hostname> kernel: RDX: ffff92d56782b3c0 RSI: ffffffffb1170955 RDI: 0000000000000246
May 25 21:49:22 <hostname> kernel: RBP: ffff92d560914638 R08: 00000001381ddad7 R09: 0000000000000016
May 25 21:49:22 <hostname> kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff92d560a99148
May 25 21:49:22 <hostname> kernel: R13: 0000000000000000 R14: ffff92d560a9c600 R15: ffff92d5609a02e8
May 25 21:49:22 <hostname> kernel: FS:  00007f87374a9b80(0000) GS:ffff92d57ca00000(0000) knlGS:0000000000000000
May 25 21:49:22 <hostname> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 25 21:49:22 <hostname> kernel: CR2: 00007ff701c13000 CR3: 00000007e7fbe000 CR4: 0000000000340ee0
May 25 21:49:22 <hostname> kernel: Call Trace:
May 25 21:49:22 <hostname> kernel:  __kernfs_create_file+0x7b/0x100
May 25 21:49:22 <hostname> kernel:  sysfs_add_file_mode_ns+0xa3/0x190
May 25 21:49:22 <hostname> kernel:  ? is_module_address+0x25/0x40
May 25 21:49:22 <hostname> kernel:  sysfs_create_bin_file+0x50/0x70
May 25 21:49:22 <hostname> kernel:  hdcp_create_workqueue+0x3b1/0x400 [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_dm_init.isra.0.cold+0xa0/0x1026 [amdgpu]
May 25 21:49:22 <hostname> kernel:  ? lockdep_hardirqs_on+0x11e/0x1b0
May 25 21:49:22 <hostname> kernel:  ? hdcp_update_display+0x1f0/0x1f0 [amdgpu]
May 25 21:49:22 <hostname> kernel:  dm_hw_init+0xe/0x20 [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_device_init.cold+0x165f/0x1a7d [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_driver_load_kms+0x5c/0x200 [amdgpu]
May 25 21:49:22 <hostname> kernel:  amdgpu_pci_probe+0xf4/0x180 [amdgpu]
May 25 21:49:22 <hostname> kernel:  local_pci_probe+0x42/0x80
May 25 21:49:22 <hostname> kernel:  pci_device_probe+0xd9/0x190
May 25 21:49:22 <hostname> kernel:  really_probe+0x167/0x410
May 25 21:49:22 <hostname> kernel:  driver_probe_device+0xb6/0x100
May 25 21:49:22 <hostname> kernel:  device_driver_attach+0xa8/0xb0
May 25 21:49:22 <hostname> kernel:  __driver_attach+0x8c/0x150
May 25 21:49:22 <hostname> kernel:  ? device_driver_attach+0xb0/0xb0
May 25 21:49:22 <hostname> kernel:  ? device_driver_attach+0xb0/0xb0
May 25 21:49:22 <hostname> kernel:  bus_for_each_dev+0x67/0x90
May 25 21:49:22 <hostname> kernel:  bus_add_driver+0x12e/0x1f0
May 25 21:49:22 <hostname> kernel:  driver_register+0x8b/0xe0
May 25 21:49:22 <hostname> kernel:  ? 0xffffffffc0d49000
May 25 21:49:22 <hostname> kernel:  do_one_initcall+0x69/0x350
May 25 21:49:22 <hostname> kernel:  ? kmem_cache_alloc_trace+0x11a/0x230
May 25 21:49:22 <hostname> kernel:  ? do_init_module+0x23/0x260
May 25 21:49:22 <hostname> kernel:  do_init_module+0x5c/0x260
May 25 21:49:22 <hostname> kernel:  __do_sys_init_module+0x162/0x190
May 25 21:49:22 <hostname> kernel:  do_syscall_64+0x5c/0xa0
May 25 21:49:22 <hostname> kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xb3
May 25 21:49:22 <hostname> kernel: RIP: 0033:0x7f87385fb40e
May 25 21:49:22 <hostname> kernel: Code: 48 8b 0d 8d 0a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5a 0a 0c 00 f7 d8 64 89 01 48
May 25 21:49:22 <hostname> kernel: RSP: 002b:00007ffdf88b1948 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
May 25 21:49:22 <hostname> kernel: RAX: ffffffffffffffda RBX: 00005573b33c7db0 RCX: 00007f87385fb40e
May 25 21:49:22 <hostname> kernel: RDX: 00007f873825595d RSI: 00000000009f5766 RDI: 00005573b3cd4980
May 25 21:49:22 <hostname> kernel: RBP: 00005573b3cd4980 R08: 00005573b33effa0 R09: 00007ffdf88b0f56
May 25 21:49:22 <hostname> kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 0000000000000000
May 25 21:49:22 <hostname> kernel: R13: 00007f873825595d R14: 00005573b33a7fd0 R15: 00005573b33cc8b0
May 25 21:49:22 <hostname> kernel: irq event stamp: 213599
May 25 21:49:22 <hostname> kernel: hardirqs last  enabled at (213599): [<ffffffffb116f0bf>] console_unlock+0x4af/0x6c0
May 25 21:49:22 <hostname> kernel: hardirqs last disabled at (213598): [<ffffffffb116ecbd>] console_unlock+0xad/0x6c0
May 25 21:49:22 <hostname> kernel: softirqs last  enabled at (213522): [<ffffffffb1e0037d>] __do_softirq+0x37d/0x4a8
May 25 21:49:22 <hostname> kernel: softirqs last disabled at (213515): [<ffffffffb10ec245>] irq_exit+0xe5/0x130
May 25 21:49:22 <hostname> kernel: ---[ end trace 81a4608bfead876c ]---

Does anyone recognize this?
What are the consequences of this? When a kernel driver has a crash, like this, what happens? Is it possible for this to cause other problems, later, even if the system appears to recover immediately after this?

strikerttd · May 28, 2020, 6:34am

Raw link for curl purposes

So, I’d like to have some more information from your system which might help in identifying whether this is a known bug or not:

NOTE: The first command below will reveal your machine’s hostname as well as some other information you may wish to scrub before linking it here.

$ fpaste --sysinfo
$ rpm -qa | fpaste

IMO, any amdgpu bugs are squashed quite quickly. The built-in AMDGPU drivers are constantly monitored by developers and you shouldn’t see issues like this if you are (1) using the amdgpu driver that comes built-in to the Kernel and (2) are not using a nightly or rolling version of Fedora. But, it definitely looks like the amdgpu driver is dumping due to how the memory allocation is handled (a wild guess).

pdestefa · May 28, 2020, 7:59am

Hey Striker! Thanks for the help! I recognize you from FF package releases on bodhi.

Thanks for the heads up; boy, my rpm history reveals more than I expected.

https://paste.centos.org/view/601b5cf3
https://paste.centos.org/view/70ea5448

So, here is the real issue: for the past two months (since F31/kernel 5.4) my machine has been grinding to a halt 3-4 days after a reboot. During my troubleshooting, I have upgraded to F32, used kernels from testing, rawhide, and upstream via copr, which iyou will see in the pastebins, and replaced my RAM and increased it.

Here’s the symptom: when I come back to the system to wake the monitor, I can tell BOINC isn’t running because the fans are quiet. But, here’s the thing, it’s not always unresponsive and it’s never totally hung. Some of it is sill there, running like normal.

If the problem hasn’t started much earlier, I can login via SSH and the mouse is responsive at the lockscreen. But, sometimes SSH will authenticate but not be able to spawn a shell. If I do get in, it looks okay, but some processes are not running any more and there are no errors to explain why. I can type my passwd at the lockscreen, but it cannot unlock. While I’m poking around, it will get worse. When I hit tap power button, systemd can sometimes initiates a shutdown. Also, sometimes I can get to a VC but getty is not there and the kernel will tell me that SysReq is disabled. But, ultimately, I have to hard reset.

The worst part is: there are no errors, other than this one I’m reporting. But, I cannot figure out if or how it could be related to this other issue. If it was OOM, wouldn’t the kernel report that? Wouldn’t the kernel report any reason a process failed unexpectedly? It’s like something is preventing running processes from getting scheduled and new processes from being created, but the kernel doesn’t report it. What could act like this but not trip any kernel resource protections.

It’s as if the system is out of some internal resource that doesn’t affect what I can see with top or vmstat. Like it ran out of PIDs or something. That’s some old school stuff I have seen and this sort of reminds me of that.

Anyway, if you have any ideas, I’d really appreciate the help. Let me know if you think it’s related to this kernel bug, if you can; that’d be helpful.

strikerttd · May 28, 2020, 8:07am

Taking into account everything you’ve mentioned, I would suggest running MemTest86 against your RAM to see if it comes up with any issues. If I didn’t know any better, I’d say that it sounds like you have a RAM controller that might be going bad (not the module itself). If that’s the case, you’ll see errors even with new modules installed.

It might be worth your while to install F32 Stable and mess around with that for a while to see if you encounter the same types of issues.

From a diagnostics standpoint, I’d take into account any similarities that the crashing or disappearing services might have in common. In this case, it would either be an experimental Kernel with some funky memory allocation issues or the hardware itself.

EDIT

I’d also take a look at journalctl to see if anything stands out. If you don’t mind me also seeing it, print it to output it to /tmp and upload it somewhere:

$ sudo journalctl -b > /tmp/journalctl.out

pdestefa · May 28, 2020, 5:10pm

Okay, that puts my system out of commission for a while, but I’ll try to make some time for it.

If it is the controller, would memtester work as well as MemTest86? I ran that for a bit, but didn’t let it go through a full cycle.

In any case, you don’t see any obvious connection to this kernel bug?

pdestefa · May 28, 2020, 5:15pm

BTW, kernel 5.7 rc5.1 seems to have improved the situation kernel BUG: that I originally reported. Since last boot, I haven’t had it recur but I’ve locked and unlocked my screen twice.

As you say, and as I have heard from many sources, amdgpu is receiving the most attention upstream. It’s a catch 22; I’ve experienced many amdgpu problems and had to resort to newer kernels, but then it’s just another unknown when reporting to fedora.

pauld · May 29, 2020, 9:53pm

Maybe I have this wrong… I guess RAS detect the error… and try to reset the GPU… but because of the RAS error it cannot? So the RAS might not be the real issue here. I guess it just means that the option to disable RAS is likely not to help much.

Could you give the pci id of your video card?

Googling a bit about RAs:

"RAS – Reliability, Availability, Serviceability – for supported hardware that at least for now appears to be focused on Vega 20 – likely just the Radeon Instinct products and not Radeon VII. The AMDGPU RAS support includes SRAM/VRAM ECC, bad page tracking, and error containment. "

It is apparently possible to deactivate RAS totally, or partially.
From drm/amdgpu AMDgpu driver — The Linux Kernel documentation

ras_enable (int)
Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))

ras_mask (uint)
Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1 See the flags in drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h

I am quite unsure but you could try kernel parameter amdgpgu.ras_enable=0

pdestefa · May 30, 2020, 1:18am

Hey pauld, I thought that was one of my handles, actually. Small world.

I don’t know what RAS has to do with anything; can’t even tell what it does and not sure my card has that capability. What do you mean by “pci id”. That usually means the vendor code that PCI cards return when queried, but lspci de-references that for you, usually. I’ll just give you both:

08:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev c1)

It doesn’t seem like RAS is causing any of the problem, does it? Sounds more like RAS just isn’t doing anything. In any case, I’d like to understand this error better, but it seems to be gone. That’s what it’s like with AMD: you can fight something for months or longer and then it’s just gone one day with a new kernel and you have no idea what other things problems were related to it.

pdestefa · June 2, 2020, 4:40pm

So, I thought this had got fixed, but apparently not. I have a different kernel, now, and it’s back! Dang it.

Topic		Replies	Views
amdgpu spikes temp Project Discussion server-wg	2	430	August 31, 2022
AMD Radeon 780M iGPU Issues Ask Fedora amd , radeon	10	1890	October 16, 2024
Desktop complete freeze up, but can SSH into and restart lightdm, otherwise have to reboot Ask Fedora xfce , nouveau , nvidia	3	2199	October 4, 2019
New machine, Familiar problem: Nvidia drivers not loading Ask Fedora f36 , f37 , kde , amd , amdgpu , intel , nvidia	4	1116	February 9, 2023
Graphic card (RX 5500 XT) problem Ask Fedora f32	8	1466	May 30, 2020

Kernel: BUG: sleeping function called from invalid context at mm/slab.h:567

Related topics