VCPU hotplug failure for a KVM guest leads to guest termination

Problem

If the operation of hotplugging a VCPU to a KVM guest running in an LPAR fails, the KVM guest will terminate abruptly.

The bug will be observed in the following scenario:

  1. The maxvcpus (eg: 128) specified are more than the current vcpus (eg: 4)
  2. The user attempts to hotplug vcpus as follows:
virsh setvcpus <guest_name> 68
  1. The following error can be observed
KVM: Create Guest vcpu hcall failed, rc=-44
error: Unable to read from monitor: Connection reset by peer

Cause

During a VCPU hotplug operation for a running qemukvm guest in a ppc64 LPAR, kvm will request the required number of vCPUs from PowerVM Hypervisor. This operation requires resource acquisition by PowerVM which can fail due to a transient/non-transient error. The QEMU instance running in LPAR considers this to be fatal error and proceeds to terminate the running qemukvm guest. This problem affects qemukvm guests across all supported architectures which do not preallocate vCPUs. However, it disproportionately impacts qemukvm guests in PowerVM LPAR since the PowerVM’s KVM resources are shared across multiple LPARs and are limited. Hence a transient vCPU hotplug failure can cause a running KVM guest to terminate.

Related Issues

Bugzilla report: #2304078

Workarounds

A fix is available in upstream QEMU (linked in Bugzilla report) and will be available in Fedora 40 soon. In the meanwhile, one of the following workarounds can be used to avoid the issue:

  1. Specify the current number of VCPUs to be the same as the max number of VCPUs in the XML config file for the guest. This will force QEMU to allocate all the VCPUs when starting the guests. For eg. if the user needs 128 VCPUs, the tag should be
<vcpu placement='static' current='128'>128</vcpu>
  1. Avoid hotplugging VCPUs for the running KVM guest.
1 Like

Thanks for the writeup. I currently think that this doesn’t affect a large enough portion of our userbase to be included in Common Issues . It is a very concrete and specialized bug that most people won’t hit. But you can try to convince me :slight_smile:

From Proposed Common Issues to Ask Fedora