GPU driver unbind causes freeze and 100% CPU core load

I’m currently having issues unbinding the GPU driver on my notebook which is running Fedora 30 and has Bumblebee installed and working. It used to work fine in this configuration and I can’t remember having changed anything related to bumblebee, my kernel parameters or my GPU drivers since.

I’m not sure when the issue started to happen because I only just noticed. It could have easily been there for over 3 months now.

Here’s some info on my system (kernel params, lspci output): system info · GitHub

My issue is that as soon as I run the following:

sudo optirun bash -c "echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind'"

the bash process never exits. As soon as I hit Ctrl+C to cancel, the process uses 100% of the CPU core it is running on. Killing the process has no effect. I can’t even shut down normally anymore when that happens, I always have to push the power button for 5 seconds to forcefully turn it off.


(12% because I have 8 cores and 100/8 = ~12)

I think in the past it used to work just fine by running it without optirun:

sudo bash -c "echo '0000:01:00.0' > '/sys/bus/pci/devices/0000:01:00.0/driver/unbind'"

This now results in the following though:

bash: /sys/bus/pci/devices/0000:01:00.0/driver/unbind: No such file or directory

After running with optirun, which as I said, results in a process “freeze”, the following lines got added to my dmesg:

[ 4922.022633] bbswitch: enabling discrete graphics
[ 4922.525022] IPMI message handler: version 39.2
[ 4922.526962] ipmi device interface
[ 4922.579341] nvidia: module license 'NVIDIA' taints kernel.
[ 4922.579344] Disabling lock debugging due to kernel taint
[ 4922.591888] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 4922.794587] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.44  Sun Dec  8 03:38:56 UTC 2019
[ 4922.809585] nvidia-uvm: Loaded the UVM driver, major device number 235.
[ 4923.759096] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.44  Sun Dec  8 03:29:48 UTC 2019
[ 4923.957104] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[ 4923.957106] ------------[ cut here ]------------
[ 4923.957258] WARNING: CPU: 1 PID: 3940 at /tmp/akmodsbuild.hOemv1NH/BUILD/nvidia-kmod-440.44/_kmod_build_5.3.15-200.fc30.x86_64/nvidia/nv-pci.c:560 nv_pci_remove+0x343/0x370 [nvidia]
[ 4923.957261] Modules linked in: nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) ipmi_devintf ipmi_msghandler ccm xt_nat veth nf_conntrack_netlink xt_addrtype br_netfilter rfcomm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set overlay nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep bbswitch(OE) sunrpc vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp iwlmvm coretemp kvm_intel mac80211 uvcvideo kvm raid0 snd_hda_codec_realtek libarc4 snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio videobuf2_vmalloc snd_hda_intel iwlwifi btusb videobuf2_memops snd_soc_rt5640 irqbypass iTCO_wdt snd_hda_codec btrtl intel_cstate mei_hdcp btbcm intel_uncore
[ 4923.957290]  videobuf2_v4l2 btintel videobuf2_common iTCO_vendor_support snd_soc_rl6231 bluetooth snd_soc_core videodev intel_rapl_perf cfg80211 snd_hda_core asus_wmi snd_compress input_polldev sparse_keymap snd_hwdep ac97_bus snd_pcm_dmaengine snd_seq i2c_i801 mc acpi_als rtsx_pci_ms ecdh_generic snd_seq_device lpc_ich kfifo_buf mei_me memstick rfkill ecc mei snd_pcm industrialio snd_timer snd soundcore acpi_pad ip_tables dm_crypt i915 rtsx_pci_sdmmc mmc_core crct10dif_pclmul i2c_algo_bit crc32_pclmul drm_kms_helper crc32c_intel mxm_wmi drm ghash_clmulni_intel serio_raw rtsx_pci r8169 video wmi fuse
[ 4923.957313] CPU: 1 PID: 3940 Comm: bash Tainted: P        W  OE     5.3.15-200.fc30.x86_64 #1
[ 4923.957314] Hardware name: GIGABYTE P35V4/P35V4, BIOS FD0B 11/06/2017
[ 4923.957470] RIP: 0010:nv_pci_remove+0x343/0x370 [nvidia]
[ 4923.957473] Code: 4c 0b c3 eb 9f 41 8b 94 24 70 04 00 00 48 c7 c6 70 34 1b c2 bf 04 00 00 00 e8 89 86 00 00 48 c7 c7 b8 34 1b c2 e8 5b 4a 0c c3 <0f> 0b e8 d6 8c 00 00 eb f9 4c 89 e6 48 89 ef e8 59 7b 75 00 e9 23
[ 4923.957475] RSP: 0018:ffffb9c40aa37dd8 EFLAGS: 00010246
[ 4923.957477] RAX: 0000000000000024 RBX: ffff9aef53a8a000 RCX: 0000000000000006
[ 4923.957478] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff9aef56a57900
[ 4923.957479] RBP: ffff9aed4b7bb008 R08: ffffb9c40aa37c95 R09: 00000000000004e1
[ 4923.957481] R10: ffffb9c40aa37c90 R11: ffffb9c40aa37c95 R12: ffff9aeef9796000
[ 4923.957482] R13: ffff9aef53a8a000 R14: ffffb9c40aa37f00 R15: ffff9aef4a8bcaa0
[ 4923.957484] FS:  00007f4d9b03b740(0000) GS:ffff9aef56a40000(0000) knlGS:0000000000000000
[ 4923.957486] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4923.957487] CR2: 0000561908dc38c8 CR3: 00000002197e0004 CR4: 00000000003606e0
[ 4923.957488] Call Trace:
[ 4923.957497]  pci_device_remove+0x3b/0xa0
[ 4923.957501]  device_release_driver_internal+0xd8/0x1b0
[ 4923.957504]  unbind_store+0xef/0x120
[ 4923.957508]  kernfs_fop_write+0x10e/0x190
[ 4923.957511]  vfs_write+0xb6/0x1a0
[ 4923.957514]  ksys_write+0x5f/0xe0
[ 4923.957518]  do_syscall_64+0x5f/0x1a0
[ 4923.957522]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4923.957524] RIP: 0033:0x7f4d9b6d3218
[ 4923.957526] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 83 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[ 4923.957527] RSP: 002b:00007fff5c274058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 4923.957529] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f4d9b6d3218
[ 4923.957530] RDX: 000000000000000d RSI: 0000561908dc28c0 RDI: 0000000000000001
[ 4923.957532] RBP: 0000561908dc28c0 R08: 0000561908dc28c0 R09: 000000000000000a
[ 4923.957533] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000d
[ 4923.957534] R13: 00007f4d9b7a76c0 R14: 000000000000000d R15: 00007f4d9b7a2800
[ 4923.957536] ---[ end trace ad6a1a5d61f73b3d ]---

Any ideas how I can fix that?

This is how I’m trying to rebind my dGPU in order to switch from the nvidia driver to the vfio driver:

DGPU_PCI_ADDRESS="01:00.0"

fedora@linux:~$ echo "> Retrieving and parsing dGPU IDs..."
> Retrieving and parsing dGPU IDs...
fedora@linux:~$ DGPU_IDS=$(sudo ${OPTIRUN_PREFIX}lspci -n -s "${DGPU_PCI_ADDRESS}" | grep -oP "\w+:\w+" | tail -1)
fedora@linux:~$ DGPU_VENDOR_ID=$(echo "${DGPU_IDS}" | cut -d ":" -f1)
fedora@linux:~$ DGPU_DEVICE_ID=$(echo "${DGPU_IDS}" | cut -d ":" -f2)
fedora@linux:~$ echo "> DGPU_IDS: $DGPU_IDS"
> DGPU_IDS: 10de:13d7
fedora@linux:~$ echo "> DGPU_VENDOR_ID: $DGPU_VENDOR_ID"
> DGPU_VENDOR_ID: 10de
fedora@linux:~$ echo "> DGPU_DEVICE_ID: $DGPU_DEVICE_ID"
> DGPU_DEVICE_ID: 13d7
    
echo "> Unbinding dGPU nvidia driver..."
> Unbinding dGPU nvidia driver...
sudo optirun bash -c "echo '0000:${DGPU_PCI_ADDRESS}' > '/sys/bus/pci/devices/0000:${DGPU_PCI_ADDRESS}/driver/unbind'"

# This line is never reached because it gets stuck in the previous line:

echo "> Binding dGPU to VFIO driver..."
sudo bash -c "echo '${DGPU_VENDOR_ID} ${DGPU_DEVICE_ID}' > '/sys/bus/pci/drivers/vfio-pci/new_id'"

(I need this in order to pass the dGPU through to a VM btw. and as I said, this used to work just fine.)