Occasional freeze on Linux guests QEMU/KVM

Recently, my virtual machines are occasionally freezing either during boot or during normal operation. Sometimes the freeze is so bad that it lags the host requiring destroying the running VM. Any clue about why this is happening? I checked the logs (journalctl -b 0) and nothing looks out of the ordinary.

The VMs are both Linux based (Ubuntu 22.04 server, Debian 11).

Edit: The Ubuntu VM has possibly useful logs:

[  471.737097] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/2:2:1194]
[  471.737136] Modules linked in: xt_conntrack xt_MASQUERADE nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter intel_rapl_msr intel_rapl_common kvm_intel kvm rapl snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep input_leds snd_pcm serio_raw snd_timer snd soundcore joydev mac_hid qemu_fw_cfg binfmt_misc sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua overlay iptable_filter ip6table_filter ip6_tables br_netfilter bridge ramoops stp llc pstore_blk reed_solomon arp_tables pstore_zone efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xhci_pci qxl virtio_net drm_ttm_helper aesni_intel ttm net_failover drm_kms_helper
[  471.737207]  crypto_simd syscopyarea sysfillrect sysimgblt ahci fb_sys_fops cec i2c_i801 rc_core cryptd drm psmouse libahci failover i2c_smbus lpc_ich virtio_blk xhci_pci_renesas
[  471.737222] CPU: 2 PID: 1194 Comm: kworker/2:2 Not tainted 5.15.0-58-generic #64-Ubuntu
[  471.737225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014
[  471.737228] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper]
[  471.737273] RIP: 0010:smp_call_function_many_cond+0x13c/0x360
[  471.737277] Code: 02 41 89 c4 73 2e 4d 63 ec 48 8b 0b 49 81 fd ff 1f 00 00 0f 87 e4 01 00 00 4a 03 0c ed e0 ca ae a8 8b 41 08 a8 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 eb bb 48 83 c4 40 5b 41 5c 41 5d 41 5e 41
[  471.737279] RSP: 0018:ffffb1e6c00d3ad8 EFLAGS: 00000202
[  471.737280] RAX: 0000000000000011 RBX: ffff8c067dd31bc0 RCX: ffff8c067dcb76e0
[  471.737282] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8c060105f460
[  471.737282] RBP: ffffb1e6c00d3b40 R08: 0000000000000000 R09: 0000000000000000
[  471.737283] R10: 0000000000000001 R11: fffffffffffffffe R12: 0000000000000001
[  471.737284] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000003
[  471.737285] FS:  0000000000000000(0000) GS:ffff8c067dd00000(0000) knlGS:0000000000000000
[  471.737286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  471.737288] CR2: 000055e5259b5800 CR3: 000000000b6d2005 CR4: 0000000000370ee0
[  471.737291] Call Trace:
[  471.737294]  <TASK>
[  471.737299]  ? invalidate_user_asid+0x30/0x30
[  471.737303]  on_each_cpu_cond_mask+0x1d/0x30
[  471.737305]  flush_tlb_kernel_range+0x41/0xa0
[  471.737307]  __purge_vmap_area_lazy+0xbd/0x6f0
[  471.737310]  ? free_vmap_area_noflush+0x28e/0x310
[  471.737312]  ? kmem_cache_free+0x24f/0x290
[  471.737315]  free_vmap_area_noflush+0x2c7/0x310
[  471.737317]  remove_vm_area+0xa5/0xc0
[  471.737320]  __vunmap+0x93/0x260
[  471.737322]  vunmap+0x30/0x40
[  471.737324]  ttm_bo_vunmap+0xc3/0xe0 [ttm]
[  471.737333]  ? qxl_image_init+0x43/0x50 [qxl]
[  471.737337]  qxl_bo_vunmap_locked+0x38/0x40 [qxl]
[  471.737340]  qxl_draw_dirty_fb+0x28b/0x430 [qxl]
[  471.737344]  qxl_framebuffer_surface_dirty+0xf8/0x1d0 [qxl]
[  471.737347]  drm_fb_helper_damage_work+0xc2/0x200 [drm_kms_helper]
[  471.737359]  process_one_work+0x228/0x3d0
[  471.737361]  worker_thread+0x53/0x420
[  471.737362]  ? process_one_work+0x3d0/0x3d0
[  471.737363]  kthread+0x127/0x150
[  471.737365]  ? set_kthread_struct+0x50/0x50
[  471.737367]  ret_from_fork+0x1f/0x30
[  471.737371]  </TASK>
[  499.735338] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [kworker/2:2:1194]
[  499.735374] Modules linked in: xt_conntrack xt_MASQUERADE nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter intel_rapl_msr intel_rapl_common kvm_intel kvm rapl snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep input_leds snd_pcm serio_raw snd_timer snd soundcore joydev mac_hid qemu_fw_cfg binfmt_misc sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua overlay iptable_filter ip6table_filter ip6_tables br_netfilter bridge ramoops stp llc pstore_blk reed_solomon arp_tables pstore_zone efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xhci_pci qxl virtio_net drm_ttm_helper aesni_intel ttm net_failover drm_kms_helper
[  499.735435]  crypto_simd syscopyarea sysfillrect sysimgblt ahci fb_sys_fops cec i2c_i801 rc_core cryptd drm psmouse libahci failover i2c_smbus lpc_ich virtio_blk xhci_pci_renesas
[  499.735448] CPU: 2 PID: 1194 Comm: kworker/2:2 Tainted: G             L    5.15.0-58-generic #64-Ubuntu
[  499.735451] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014
[  499.735453] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper]
[  499.735473] RIP: 0010:smp_call_function_many_cond+0x13c/0x360
[  499.735478] Code: 02 41 89 c4 73 2e 4d 63 ec 48 8b 0b 49 81 fd ff 1f 00 00 0f 87 e4 01 00 00 4a 03 0c ed e0 ca ae a8 8b 41 08 a8 01 74 0a f3 90 <8b> 51 08 83 e2 01 75 f6 eb bb 48 83 c4 40 5b 41 5c 41 5d 41 5e 41
[  499.735479] RSP: 0018:ffffb1e6c00d3ad8 EFLAGS: 00000202
[  499.735481] RAX: 0000000000000011 RBX: ffff8c067dd31bc0 RCX: ffff8c067dcb76e0
[  499.735483] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8c060105f460
[  499.735484] RBP: ffffb1e6c00d3b40 R08: 0000000000000000 R09: 0000000000000000
[  499.735485] R10: 0000000000000001 R11: fffffffffffffffe R12: 0000000000000001
[  499.735486] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000003
[  499.735487] FS:  0000000000000000(0000) GS:ffff8c067dd00000(0000) knlGS:0000000000000000
[  499.735489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  499.735490] CR2: 000055e5259b5800 CR3: 000000000b6d2005 CR4: 0000000000370ee0
[  499.735495] Call Trace:
[  499.735496]  <TASK>
[  499.735499]  ? invalidate_user_asid+0x30/0x30
[  499.735504]  on_each_cpu_cond_mask+0x1d/0x30
[  499.735507]  flush_tlb_kernel_range+0x41/0xa0
[  499.735509]  __purge_vmap_area_lazy+0xbd/0x6f0
[  499.735513]  ? free_vmap_area_noflush+0x28e/0x310
[  499.735515]  ? kmem_cache_free+0x24f/0x290
[  499.735519]  free_vmap_area_noflush+0x2c7/0x310
[  499.735522]  remove_vm_area+0xa5/0xc0
[  499.735525]  __vunmap+0x93/0x260
[  499.735528]  vunmap+0x30/0x40
[  499.735531]  ttm_bo_vunmap+0xc3/0xe0 [ttm]
[  499.735537]  ? qxl_image_init+0x43/0x50 [qxl]
[  499.735541]  qxl_bo_vunmap_locked+0x38/0x40 [qxl]
[  499.735545]  qxl_draw_dirty_fb+0x28b/0x430 [qxl]
[  499.735550]  qxl_framebuffer_surface_dirty+0xf8/0x1d0 [qxl]
[  499.735555]  drm_fb_helper_damage_work+0xc2/0x200 [drm_kms_helper]
[  499.735569]  process_one_work+0x228/0x3d0
[  499.735572]  worker_thread+0x53/0x420
[  499.735573]  ? process_one_work+0x3d0/0x3d0
[  499.735575]  kthread+0x127/0x150
[  499.735577]  ? set_kthread_struct+0x50/0x50
[  499.735580]  ret_from_fork+0x1f/0x30
[  499.735585]  </TASK>
[  512.618159] rcu: INFO: rcu_sched self-detected stall on CPU
[  512.618215] rcu:     1-...!: (1 ticks this GP) idle=bcd/1/0x4000000000000000 softirq=13170/13170 fqs=100 
[  512.618254]  (t=16857 jiffies g=5693 q=188)
[  512.618256] rcu: rcu_sched kthread timer wakeup didn't happen for 16634 jiffies! g5693 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  512.618294] rcu:     Possible timer handling issue on cpu=0 timer-softirq=2137
[  512.618318] rcu: rcu_sched kthread starved for 16635 jiffies! g5693 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[  512.618378] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  512.618411] rcu: RCU grace-period kthread stack dump:
[  512.618430] task:rcu_sched       state:I stack:    0 pid:   14 ppid:     2 flags:0x00004000
[  512.618435] Call Trace:
[  512.618439]  <TASK>
[  512.618445]  __schedule+0x24e/0x590
[  512.618457]  schedule+0x69/0x110
[  512.618459]  schedule_timeout+0x87/0x140
[  512.618463]  ? __bpf_trace_tick_stop+0x20/0x20
[  512.618468]  rcu_gp_fqs_loop+0xe5/0x330
[  512.618474]  rcu_gp_kthread+0xa7/0x130
[  512.618477]  ? rcu_gp_init+0x5f0/0x5f0
[  512.618480]  kthread+0x127/0x150
[  512.618485]  ? set_kthread_struct+0x50/0x50
[  512.618488]  ret_from_fork+0x1f/0x30
[  512.618497]  </TASK>
[  512.618498] rcu: Stack dump where RCU GP kthread last ran:
[  512.618520] Sending NMI from CPU 1 to CPUs 0:
[  512.618948] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xb/0x10

Summarized host specs:

CPU: quad core Intel Core i7-8565U (-MCP-) speed/min/max: 1450/400/4600 MHz
Kernel: 6.1.8-200.fc37.x86_64 x86_64 Up: 12h 47m Mem: 2643.9/7822.1 MiB
(33.8%) Storage: 1.14 TiB (14.2% used) Procs: 287 Shell: Bash inxi: 3.3.24

This kind of issue is typically related to power management or memory allocation and swapping.

1 Like

I generally have my ram usage below 3GB, and the Ubuntu server has 2GB, total is ~5-6GB out of 8GB so that isn’t the problem probably.

To add insult to injury the host’s terminals sometimes hang during the lag spike.

I set the cpu governor to performance and it got the same issue…

If you only have 8GB ram, with half allocated to zram swap, then it only leaves ~4GB for the system. Using 3GB for the system and 2GB for the VM means you have allocated more than is available.

Analyze the output when the issue happens:

powerprofilesctl get; free -m; swapon -s; zramctl

You can read about zram here:

1 Like

Here it goes during the lag spike:

balanced
               total        used        free      shared  buff/cache   available
Mem:            7822        3540         431         520        3849        3460
Swap:           7821           2        7819
Filename                                Type            Size            Used            Priority
/dev/zram0                              partition       8009724         2560            100
NAME       ALGORITHM DISKSIZE DATA  COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 lzo-rle       7.6G   2M 192.5K  532K       4 [SWAP]

I used to run 2 VMs and hitting almost 7GB of RAM usage without problems. During the lag spike the vCPUs are hitting 100% usage on the host even when the RAM is less than 50% usage.

1 Like

Try installing and booting the earliest available kernel:

sudo dnf downgrade kernel

Also check packet loss when the issue happens:

sudo dnf install mtr
2 Likes