CoreOS 41 CPU lockup on nfs activity

Hi everyone,

I am reaching out today to seek for help about an issue I have been observing recently on my CoreOS 41 virtual machine.

This VM has been running for the past 4 years with no issues, it looks like kernel 6.11.5 / CoreOS 41 may have doomed it.

The VM is running on a Truenas Scale host (Dragonfish-24.04.2.5) which is based on QEmu has 2 CPUs with 2 Cores and 2 Threads, 16 Gb memory, a UEFI bootloader, 200 Gb disk and one NIC.

The VM is running Fedora CoreOS version 41.20241027.3.0
Linux coreos 6.11.5-300.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 22 20:11:15 UTC 2024 x86_64 GNU/Linux

The sole purpose of the VM is to act as a docker host, the OS is running nothing besides docker.
It is running 36 containers including portainer.
The container have various purposes and performance needs, some very low (mrtg, acme) some a little more (freshRSS, phpMyAdmin, alpine-cron, mariadb) and a few with more needs (gitlab, nextcloud, mastodon, transmission, borgmatic, synching).

The issues I am facing started on November 8th, coinciding with the release of CoreOS 41, it was running flawlessly until then.

Since that day the machine is freezing very frequently, almost daily, during nfs write operations. When the issue occurs the machine will run into a softlockup, the process currently using nfs will produce a cpu spike and remain locked (e.g. not possible to kill -9)

Please see below a dmesg and journalctl dump.

dmesg November 17th

Nov 17 15:41:58 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 78s! [kworker/u16:7:101418]
Nov 17 15:41:58 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 17 15:41:59 coreos kernel: #1: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #2: 100% system, 1% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #4: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #5: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
Nov 17 15:41:59 coreos kernel: ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 17 15:41:59 coreos kernel: CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 17 15:41:59 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 17 15:41:59 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 17 15:41:59 coreos kernel: Workqueue: writeback wb_workfn (flush-0:279)
Nov 17 15:41:59 coreos kernel: RIP: 0010:nfs_page_group_unlock+0x28/0x40 [nfs]
Nov 17 15:41:59 coreos kernel: Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf 48 8b 43 38 c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9 e2 79 f1 dd
Nov 17 15:41:59 coreos kernel: RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
Nov 17 15:41:59 coreos kernel: RAX: 0000000000000027 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
Nov 17 15:42:00 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
Nov 17 15:42:00 coreos kernel: RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
Nov 17 15:42:00 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
Nov 17 15:42:00 coreos kernel: R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
Nov 17 15:42:00 coreos kernel: FS: 0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
Nov 17 15:42:00 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 15:42:00 coreos kernel: CR2: 00003c020063e000 CR3: 000000028a42a001 CR4: 0000000000370ef0
Nov 17 15:42:00 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 17 15:42:00 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 17 15:42:00 coreos kernel: Call Trace:
Nov 17 15:42:00 coreos kernel:
Nov 17 15:42:08 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 17 15:42:08 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 17 15:42:11 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 17 15:42:15 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 17 15:42:19 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 17 15:42:21 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 17 15:42:25 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 17 15:42:44 coreos kernel:
Nov 17 15:43:02 coreos kernel:

journalctl November 17th

[ 8520.270364] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u16:7:101418]
[ 8520.271777] CPU#2 Utilization every 4s during lockup:
[ 8520.272344] #1: 84% system, 16% softirq, 1% hardirq, 0% idle
[ 8520.273036] #2: 98% system, 2% softirq, 1% hardirq, 0% idle
[ 8520.273717] #3: 100% system, 1% softirq, 0% hardirq, 0% idle
[ 8520.274405] #4: 100% system, 0% softirq, 1% hardirq, 0% idle
[ 8520.275130] #5: 100% system, 1% softirq, 1% hardirq, 0% idle
[ 8520.275821] Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 8520.275890] ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
[ 8520.275901] CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Not tainted 6.11.5-300.fc41.x86_64 #1
[ 8520.275904] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 8520.275908] Workqueue: writeback wb_workfn (flush-0:279)
[ 8520.275918] RIP: 0010:nfs_page_group_unlock+0x24/0x40 [nfs]
[ 8520.275955] Code: 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf <48> 8b 43 38 f6 c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9
[ 8520.275957] RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
[ 8520.275961] RAX: 0000000000000001 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
[ 8520.275963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
[ 8520.275964] RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
[ 8520.275966] R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
[ 8520.275967] R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
[ 8520.275969] FS: 0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
[ 8520.275970] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8520.275972] CR2: 00003c020063e000 CR3: 0000000341ad2004 CR4: 0000000000370ef0
[ 8520.275975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8520.275976] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8520.275978] Call Trace:
[ 8520.275980]
[ 8520.275985] ? watchdog_timer_fn.cold+0x233/0x311
[ 8520.275990] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 8520.275993] ? __hrtimer_run_queues+0x113/0x280
[ 8520.275996] ? kvm_clock_get_cycles+0x18/0x30
[ 8520.275999] ? hrtimer_interrupt+0xfa/0x210
[ 8520.276002] ? __sysvec_apic_timer_interrupt+0x52/0x100
[ 8520.276005] ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 8520.276008]
[ 8520.276009]
[ 8520.276010] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 8520.276014] ? nfs_page_group_unlock+0x24/0x40 [nfs]
[ 8520.276045] nfs_lock_and_join_requests+0x1b3/0x270 [nfs]
[ 8520.276082] nfs_page_async_flush+0x1b/0x1f0 [nfs]
[ 8520.276111] ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
[ 8520.276141] nfs_writepages_callback+0x30/0x50 [nfs]
[ 8520.276168] write_cache_pages+0x65/0xc0
[ 8520.276172] nfs_writepages+0x162/0x2e0 [nfs]
[ 8520.276200] ? __dev_queue_xmit+0x274/0xe80
[ 8520.276203] ? selinux_ip_postroute+0x114/0x430
[ 8520.276209] do_writepages+0x7e/0x280
[ 8520.276212] __writeback_single_inode+0x41/0x340
[ 8520.276216] writeback_sb_inodes+0x21c/0x4f0
[ 8520.276223] __writeback_inodes_wb+0x4c/0xf0
[ 8520.276226] wb_writeback+0x193/0x310
[ 8520.276229] wb_workfn+0x2af/0x450
[ 8520.276233] process_one_work+0x176/0x330
[ 8520.276237] worker_thread+0x252/0x390
[ 8520.276240] ? __pfx_worker_thread+0x10/0x10
[ 8520.276243] kthread+0xcf/0x100
[ 8520.276245] ? __pfx_kthread+0x10/0x10
[ 8520.276247] ret_from_fork+0x31/0x50
[ 8520.276250] ? __pfx_kthread+0x10/0x10
[ 8520.276252] ret_from_fork_asm+0x1a/0x30
[ 8520.276256]

journalctl November 21st (new box)

Nov 21 07:32:06 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 25008s! [borg:491201]
Nov 21 07:32:06 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 21 07:32:06 coreos kernel: #1: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #2: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #4: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #5: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace xt_nat veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl qxl virtio_balloon drm_ttm_helper ttm i2c_piix4 i2c_smbus joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_blk virtio_net net_failover failover nvme_tcp nvme_fabrics ata_generic nvme_keyring pata_acpi nvme_core nvme_auth serio_raw be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Nov 21 07:32:06 coreos kernel: scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 21 07:32:06 coreos kernel: CPU: 2 UID: 0 PID: 491201 Comm: borg Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 21 07:32:06 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 21 07:32:06 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 21 07:32:06 coreos kernel: RIP: 0010:nfs_lock_and_join_requests+0x19d/0x270 [nfs]
Nov 21 07:32:06 coreos kernel: Code: 08 48 89 ef e8 54 b7 ff ff 48 8b 6d 48 49 39 ec 75 e8 4c 89 e7 e8 f3 b6 ff ff 48 89 df 49 63 df e8 38 b7 ff ff e9 fd fe ff ff <4d> 8b 64 24 48 4c 39 e3 0f 85 45 ff ff ff 48 89 df e8 8d ab ff ff
Nov 21 07:32:06 coreos kernel: RSP: 0018:ffff9c9d0832f6d8 EFLAGS: 00000246
Nov 21 07:32:06 coreos kernel: RAX: 0000000000000001 RBX: ffff8bd836060200 RCX: 0000000000000001
Nov 21 07:32:06 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8bd96f644434
Nov 21 07:32:06 coreos kernel: RBP: ffff8bd836060238 R08: 0000000000000000 R09: 000000000000001b
Nov 21 07:32:06 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff8bd96f644400
Nov 21 07:32:06 coreos kernel: R13: ffffdc3800ddd000 R14: 0000000000000001 R15: 0000000000000000
Nov 21 07:32:06 coreos kernel: FS: 00007f65b7d6bb08(0000) GS:ffff8bdb2fb00000(0000) knlGS:0000000000000000
Nov 21 07:32:06 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 07:32:06 coreos kernel: CR2: 00007f65b0991000 CR3: 00000001256fc006 CR4: 0000000000370ef0
Nov 21 07:32:06 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 21 07:32:06 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 21 07:32:06 coreos kernel: Call Trace:
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:06 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 21 07:32:06 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 21 07:32:06 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 21 07:32:06 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 21 07:32:06 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 21 07:32:06 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 21 07:32:06 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:06 coreos kernel: ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
Nov 21 07:32:06 coreos kernel: ? nfs_lock_and_join_requests+0x19d/0x270 [nfs]
Nov 21 07:32:06 coreos kernel: ? nfs_lock_and_join_requests+0xcc/0x270 [nfs]
Nov 21 07:32:06 coreos kernel: nfs_page_async_flush+0x1b/0x1f0 [nfs]
Nov 21 07:32:06 coreos kernel: ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
Nov 21 07:32:06 coreos kernel: nfs_writepages_callback+0x30/0x50 [nfs]
Nov 21 07:32:06 coreos kernel: write_cache_pages+0x65/0xc0
Nov 21 07:32:06 coreos kernel: nfs_writepages+0x162/0x2e0 [nfs]
Nov 21 07:32:06 coreos kernel: do_writepages+0x7e/0x280
Nov 21 07:32:06 coreos kernel: filemap_fdatawrite_wbc+0x6e/0xa0
Nov 21 07:32:06 coreos kernel: __filemap_fdatawrite_range+0x60/0x90
Nov 21 07:32:06 coreos kernel: file_write_and_wait_range+0x47/0xb0
Nov 21 07:32:06 coreos kernel: nfs_file_fsync+0xa2/0x1c0 [nfs]
Nov 21 07:32:06 coreos kernel: __x64_sys_fdatasync+0x4f/0x90
Nov 21 07:32:06 coreos kernel: do_syscall_64+0x82/0x160
Nov 21 07:32:06 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:06 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:06 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:06 coreos kernel: ? do_vmi_align_munmap+0x266/0x550
Nov 21 07:32:06 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:06 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:06 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:06 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:06 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:06 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:06 coreos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Nov 21 07:32:06 coreos kernel: RIP: 0033:0x7f65b7d28862
Nov 21 07:32:06 coreos kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 e9 16 cc ff ff 41 54 b8 02 00 00 00 55 48 89 f5 be 00 88 08 00
Nov 21 07:32:06 coreos kernel: RSP: 002b:00007ffe9bc6b658 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Nov 21 07:32:06 coreos kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f65b7d28862
Nov 21 07:32:06 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
Nov 21 07:32:06 coreos kernel: RBP: 00007f65b7d6bb08 R08: 0000000000000000 R09: 0000000000000000
Nov 21 07:32:06 coreos kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f65b74c67d0
Nov 21 07:32:06 coreos kernel: R13: 00007f65b7c49d78 R14: 00007f65b75c6480 R15: 00007f65b74c6838
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:34 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 25034s! [borg:491201]
Nov 21 07:32:34 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 21 07:32:34 coreos kernel: #1: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #2: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #4: 100% system, 1% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #5: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace xt_nat veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl qxl virtio_balloon drm_ttm_helper ttm i2c_piix4 i2c_smbus joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_blk virtio_net net_failover failover nvme_tcp nvme_fabrics ata_generic nvme_keyring pata_acpi nvme_core nvme_auth serio_raw be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Nov 21 07:32:34 coreos kernel: scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 21 07:32:34 coreos kernel: CPU: 2 UID: 0 PID: 491201 Comm: borg Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 21 07:32:34 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 21 07:32:34 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 21 07:32:34 coreos kernel: RIP: 0010:nfs_folio_find_head_request+0x59/0x90 [nfs]
Nov 21 07:32:34 coreos kernel: Code: db 74 1f 48 39 5b 50 75 45 48 8d 7b 34 b8 01 00 00 00 f0 0f c1 43 34 85 c0 74 27 8d 50 01 09 c2 78 14 4c 89 e7 e8 d7 0b d7 d5 <48> 89 d8 5b 5d 41 5c c3 cc cc cc cc be 01 00 00 00 e8 41 17 43 d5
Nov 21 07:32:34 coreos kernel: RSP: 0018:ffff9c9d0832f6b8 EFLAGS: 00000246
Nov 21 07:32:34 coreos kernel: RAX: 0000000000000001 RBX: ffff8bd836060200 RCX: 0000000000000001
Nov 21 07:32:34 coreos kernel: RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8bd9bdac0cbc
Nov 21 07:32:34 coreos kernel: RBP: ffffdc3800ddd000 R08: 0000000000000000 R09: 000000000000001b
Nov 21 07:32:34 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff8bd9bdac0cbc
Nov 21 07:32:34 coreos kernel: R13: ffffdc3800ddd000 R14: 0000000000000001 R15: 0000000000000000
Nov 21 07:32:34 coreos kernel: FS: 00007f65b7d6bb08(0000) GS:ffff8bdb2fb00000(0000) knlGS:0000000000000000
Nov 21 07:32:34 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 07:32:34 coreos kernel: CR2: 00007f65b0991000 CR3: 00000001256fc006 CR4: 0000000000370ef0
Nov 21 07:32:34 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 21 07:32:34 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 21 07:32:34 coreos kernel: Call Trace:
Nov 21 07:32:34 coreos kernel:
Nov 21 07:32:34 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 21 07:32:34 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 21 07:32:34 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 21 07:32:34 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 21 07:32:34 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 21 07:32:34 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 21 07:32:34 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 21 07:32:34 coreos kernel:
Nov 21 07:32:34 coreos kernel:
Nov 21 07:32:34 coreos kernel: ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
Nov 21 07:32:34 coreos kernel: ? nfs_folio_find_head_request+0x59/0x90 [nfs]
Nov 21 07:32:34 coreos kernel: ? nfs_folio_find_head_request+0x59/0x90 [nfs]
Nov 21 07:32:34 coreos kernel: nfs_lock_and_join_requests+0x64/0x270 [nfs]
Nov 21 07:32:34 coreos kernel: nfs_page_async_flush+0x1b/0x1f0 [nfs]
Nov 21 07:32:34 coreos kernel: ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
Nov 21 07:32:34 coreos kernel: nfs_writepages_callback+0x30/0x50 [nfs]
Nov 21 07:32:34 coreos kernel: write_cache_pages+0x65/0xc0
Nov 21 07:32:34 coreos kernel: nfs_writepages+0x162/0x2e0 [nfs]
Nov 21 07:32:34 coreos kernel: do_writepages+0x7e/0x280
Nov 21 07:32:34 coreos kernel: filemap_fdatawrite_wbc+0x6e/0xa0
Nov 21 07:32:34 coreos kernel: __filemap_fdatawrite_range+0x60/0x90
Nov 21 07:32:34 coreos kernel: file_write_and_wait_range+0x47/0xb0
Nov 21 07:32:34 coreos kernel: nfs_file_fsync+0xa2/0x1c0 [nfs]
Nov 21 07:32:34 coreos kernel: __x64_sys_fdatasync+0x4f/0x90
Nov 21 07:32:34 coreos kernel: do_syscall_64+0x82/0x160
Nov 21 07:32:34 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:34 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:34 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:34 coreos kernel: ? do_vmi_align_munmap+0x266/0x550
Nov 21 07:32:34 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:34 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:34 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:34 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:34 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:34 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:34 coreos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Nov 21 07:32:34 coreos kernel: RIP: 0033:0x7f65b7d28862
Nov 21 07:32:34 coreos kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 e9 16 cc ff ff 41 54 b8 02 00 00 00 55 48 89 f5 be 00 88 08 00
Nov 21 07:32:34 coreos kernel: RSP: 002b:00007ffe9bc6b658 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Nov 21 07:32:34 coreos kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f65b7d28862
Nov 21 07:32:34 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
Nov 21 07:32:34 coreos kernel: RBP: 00007f65b7d6bb08 R08: 0000000000000000 R09: 0000000000000000
Nov 21 07:32:34 coreos kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f65b74c67d0
Nov 21 07:32:34 coreos kernel: R13: 00007f65b7c49d78 R14: 00007f65b75c6480 R15: 00007f65b74c6838
Nov 21 07:32:34 coreos kernel:

I have searched around the issue and found one other user reporting it. I have contacted that user who informed me that a particular commit in the 6.11 kernel branch was causing the issue and that they are now running production with that commit removed with no issue.

Now I recon this is not a pure coreOS issue, however as the OS is not basing on kernel 6.11 which has the issue it renders my coreOS installation pretty much unusable.

I have spun a second coreOS box and transferred all my workloads to it, it did not take 2 days for the issue to re-occur so while I can’t reproduce it in the sense that I can’t “do something” that will trigger it, it can definitely be reproduced on a fresh installation.

I have 2 questions for the community:

  1. Should this issue be reported as a coreOS bug given it was reported as a kernel bug already (219508 – commit b571cfcb9dcac187c6d967987792d37cb0688610 introduces CPU soft lockup on nfs write operations on virtual machines) ?

  2. Is there a way to roll back to an older kernel on coreOS, say 6.10.12 from 40.20241006.3.0 ? If so, how ?

Thanks a lot in advance for any insights !

Thanks for the report! Can you paste all of that into an issue in GitHub - coreos/fedora-coreos-tracker: Issue tracker for Fedora CoreOS ? Thanks

The kernel’s bugzilla is often not look at by kernel developers so the best way to get this to the attention of someone is to file a bug for the kernel on Fedora’s bugzilla.

You can override the kernel to use a previous one with rpm-ostree override replace .... See: Client administration | rpm-ostree. You can find previous kernel builds from Fedora Updates System.

The issue has been opened and is availabe here : kernel: watchdog: BUG: soft lockup - CPU - CoreOS 41 CPU lockup on nfs activity · Issue #1841 · coreos/fedora-coreos-tracker

Thanks for the hint on how to use a different kernel version, I will definitely give it a try.
Is that override going to be permanent until I revoke it ? E.g. new versions of coreOS won’t overwrite it again ?

1 Like

Yes, it will remain permanent until you reset it, see Resetting overrides.

1 Like

You may also find these instructions useful and adopt them for your use case.

https://fedoraproject.org/wiki/Test_Day:2024-07-28_Kernel_6.10_Test_Week

1 Like

Quick update on this one.
I have been running stable since November 21st as I switched to kernel 6.10.12-200.fc40.

Didn’t think about posting this earlier, but here was my track record before, and this is on the freshly built box :

core@coreos:~$ last reboot

reboot system boot 6.10.12-200.fc40 Tue Nov 26 14:47 still running
reboot system boot 6.10.12-200.fc40 Thu Nov 21 12:46 - 14:46 (5+02:00)
reboot system boot 6.11.5-300.fc41. Thu Nov 21 08:34 - 12:46 (04:11)
reboot system boot 6.11.5-300.fc41. Wed Nov 20 09:48 - crash (22:45)
reboot system boot 6.11.5-300.fc41. Wed Nov 20 09:28 - 09:48 (00:20)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 21:51 - crash (11:37)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 18:18 - crash (03:32)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 18:16 - crash (00:01)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 17:14 - crash (01:02)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 18:11 - 17:14 (-00:56)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 18:09 - 17:14 (-00:55)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 15:32 - 17:08 (01:36)
reboot system boot 6.11.5-300.fc41. Tue Nov 19 15:02 - crash (00:29)

I will now try to switch to an upstream version and see how it behaves. Will keep you posted.

It unfortunately didn’t take long at all, the issue is also there in upstream 6.12
kernel-6.12.0-65.fc42

duly reported on github too, just keeping you informed here as well

dmesg kernel-6.12.0-65.fc42

[474.898009] br-ff4b6fc15e1b: port 1(veth49fe0d5) entered forwarding state
[15934.071816] hrtimer: interrupt took 2744199 ns
[27395.991743] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kworker/u34:9:238078]
[27395.991781] CPU#6 Utilization every 4s during lockup:
[27395.991788] #1: 100% system, 0% softirq, 1% hardirq, 0% idle
[27395.991797] #2: 100% system, 0% softirq, 1% hardirq, 0% idle
[27395.991805] #3: 100% system, 0% softirq, 1% hardirq, 0% idle
[27395.991813] #4: 100% system, 0% softirq, 1% hardirq, 0% idle
[27395.991821] #5: 100% system, 0% softirq, 0% hardirq, 0% idle
[27395.991831] Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver xt_nat nfs lockd grace nfs_localio veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon qxl drm_ttm_helper i2c_piix4 ttm i2c_smbus joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_blk nvme_tcp virtio_net nvme_fabrics net_failover failover nvme_keyring ata_generic pata_acpi nvme_core nvme_auth serio_raw be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi
[27395.991906] scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
[27395.991922] CPU: 6 UID: 0 PID: 238078 Comm: kworker/u34:9 Not tainted 6.12.0-65.fc42.x86_64 #1
[27395.991925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[27395.991929] Workqueue: writeback wb_workfn (flush-0:326)
[27395.991939] RIP: 0010:_raw_spin_lock+0x1b/0x30
[27395.991943] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 ff 05 88 73 e1 50 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 05 c3 cc cc cc cc 89 c6 e8 d7 04 00 00 90 c3 cc cc cc cc 90 90
[27395.991945] RSP: 0018:ffffb6e2c1b73800 EFLAGS: 00000246
[27395.991947] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[27395.991948] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c7659df3e94
[27395.991950] RBP: fffffb4745409e00 R08: 0000000000000000 R09: 0000000000000000
[27395.991951] R10: 0000000000000000 R11: ffff9c797fd217c0 R12: ffff9c7659df3e94
[27395.991952] R13: fffffb4745409e00 R14: 0000000000000001 R15: 0000000000000000
[27395.991954] FS: 0000000000000000(0000) GS:ffff9c797fd00000(0000) knlGS:0000000000000000
[27395.991956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27395.991957] CR2: 00007fd9dd3c7000 CR3: 000000043782a004 CR4: 0000000000372ef0
[27395.991961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[27395.991962] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[27395.991965] Call Trace:
[27395.991967]
[27395.991970] ? watchdog_timer_fn.cold+0x233/0x311
[27395.991975] ? __pfx_watchdog_timer_fn+0x10/0x10
[27395.991978] ? __hrtimer_run_queues+0x113/0x280
[27395.991981] ? kvm_clock_get_cycles+0x18/0x30
[27395.991984] ? hrtimer_interrupt+0xfa/0x210
[27395.991986] ? __sysvec_apic_timer_interrupt+0x52/0x100
[27395.991989] ? sysvec_apic_timer_interrupt+0x6c/0x90
[27395.991991]
[27395.991992]
[27395.991993] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[27395.991997] ? _raw_spin_lock+0x1b/0x30
[27395.992000] nfs_folio_find_head_request+0x29/0x90 [nfs]
[27395.992040] nfs_lock_and_join_requests+0x64/0x270 [nfs]
[27395.992072] nfs_page_async_flush+0x1b/0x1f0 [nfs]
[27395.992100] ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
[27395.992128] nfs_writepages_callback+0x30/0x50 [nfs]
[27395.992155] write_cache_pages+0x65/0xc0
[27395.992159] nfs_writepages+0x162/0x2e0 [nfs]
[27395.992190] do_writepages+0x7e/0x280
[27395.992193] ? __mark_inode_dirty+0x87/0x350
[27395.992196] __writeback_single_inode+0x41/0x340
[27395.992199] writeback_sb_inodes+0x21d/0x4e0
[27395.992207] __writeback_inodes_wb+0x4c/0xf0
[27395.992210] wb_writeback+0x193/0x310
[27395.992214] wb_workfn+0x355/0x450
[27395.992218] process_one_work+0x176/0x330
[27395.992222] worker_thread+0x252/0x390
[27395.992225] ? __pfx_worker_thread+0x10/0x10
[27395.992227] kthread+0xcf/0x100
[27395.992230] ? __pfx_kthread+0x10/0x10
[27395.992232] ret_from_fork+0x31/0x50
[27395.992234] ? __pfx_kthread+0x10/0x10
[27395.992236] ret_from_fork_asm+0x1a/0x30
[27395.992240]