I still have this problem with both closed and open source version of NVidia driver.
I tried to debug problem, but my kernel is locked because I use Secure Boot, which I cannot turn off.
How to debug a kernel memory leak on the locked kernel?
I still have this problem with both closed and open source version of NVidia driver.
I tried to debug problem, but my kernel is locked because I use Secure Boot, which I cannot turn off.
How to debug a kernel memory leak on the locked kernel?
Why not just turn off secure boot in BIOS while you debug?
I cannot turn off security boot in BIOS: no such option on Gigabyte Aorus notebook. I installed my own key, to sign nvidia driver built by akmod.
If you feel the memory leak is in the open source version of the driver you might consider removing the open source version and install the proprietary version. I have never heard of a memory leak there.
I tried to switch from open source version to propritary and back. Kernel memory leak persists in both versions of driver.
I want to debug source of leak in open source version of driver. It looks like the bug in a CUDA_register() function for mmap-ed memory, or something like that. Function pins a memory region and then it never unpinned after that by kernel driver, even after exit from the main process.
If the memory leak occurs regardless of which driver is used then it would seem something other than the graphics driver may be the cause.
Both drivers share a common developer NVIDIA, and they may well have made the same mistake in both drivers.
True, but that does not explain why no one else seems to be reporting the memory leak. I think it is probably something else that is in use at that time.
I certainly have never seen problems with the proprietary driver version and I use nvidia on all my systems.
Possible reasons:
a) Very few people are running CUDA at their notebooks and use Fedora and use Geforce 3080Ti.
b) I’m under attack again. (Russians are hacking us here, in Ukraine, very often).
c) Something is broken in recent kernels.
d) Something is broken in my hardware by my vendor (Gigabyte).
I noticed, that large models run by ollama causes following bug in the nvidia driver, after which memory block remains pinned in system memory:
(kmod-nvidia-6.11.8-200.fc40.x86_64-565.57.01-2.fc40.x86_64)
[ 237.625642] ------------[ cut here ]------------
[ 237.625645] WARNING: CPU: 3 PID: 4417 at mm/page_alloc.c:4677 __alloc_pages_noprof+0x2ca/0x350
[ 237.625650] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer nvidia_uvm(O) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc binfmt_misc vfat fat snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common nvidia_drm(O) snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp nvidia_modeset(O) snd_sof snd_sof_utils snd_soc_hdac_hda iwlmvm snd_soc_acpi_intel_match soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_soc_avs snd_hda_codec_realtek intel_uncore_frequency intel_uncore_frequency_common snd_soc_hda_codec snd_hda_codec_generic snd_hda_ext_core x86_pkg_temp_thermal mac80211 snd_hda_scodec_component snd_soc_core intel_powerclamp snd_hda_codec_hdmi snd_compress ac97_bus snd_pcm_dmaengine coretemp snd_hda_intel nvidia(O) snd_intel_dspcfg kvm_intel
[ 237.625680] snd_intel_sdw_acpi libarc4 snd_hda_codec snd_hda_core mei_hdcp snd_hwdep mei_pxp kvm iTCO_wdt uvcvideo spi_nor uvc videobuf2_vmalloc btusb mtd iwlwifi spd5118 videobuf2_memops snd_seq btrtl intel_pmc_bxt btintel videobuf2_v4l2 iTCO_vendor_support rapl videobuf2_common intel_rapl_msr btbcm snd_seq_device btmtk processor_thermal_device_pci videodev processor_thermal_device intel_cstate snd_pcm asus_wmi processor_thermal_wt_hint mc bluetooth cfg80211 r8169 processor_thermal_rfim wmi_bmof pcspkr intel_uncore sparse_keymap mei_me snd_timer processor_thermal_rapl platform_profile realtek intel_rapl_common cdc_ether mei spi_intel_pci i2c_i801 snd spi_intel idma64 processor_thermal_wt_req thunderbolt i2c_smbus usbnet soundcore processor_thermal_power_floor rfkill processor_thermal_mbox igen6_edac mii intel_pmc_core intel_vsec int3400_thermal pmt_telemetry acpi_thermal_rel pmt_class acpi_tad acpi_pad int3403_thermal joydev int340x_thermal_zone loop nfnetlink dm_crypt uas usb_storage xe drm_ttm_helper gpu_sched
[ 237.625713] drm_suballoc_helper drm_gpuvm drm_exec i915 i2c_algo_bit drm_buddy ttm crct10dif_pclmul nvme crc32_pclmul crc32c_intel polyval_clmulni drm_display_helper polyval_generic nvme_core hid_multitouch ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 cec nvme_auth i2c_hid_acpi i2c_hid video wmi pinctrl_tigerlake serio_raw ip6_tables ip_tables fuse
[ 237.625727] CPU: 3 UID: 969 PID: 4417 Comm: ollama_llama_se Tainted: G O 6.11.8-200.fc40.x86_64 #1
[ 237.625729] Tainted: [O]=OOT_MODULE
[ 237.625730] Hardware name: GIGABYTE AORUS 15 YE5/AORUS 15 YE5, BIOS FB09 12/27/2022
[ 237.625731] RIP: 0010:__alloc_pages_noprof+0x2ca/0x350
[ 237.625733] Code: 24 08 e9 4a fe ff ff e8 24 f5 f9 ff e9 88 fe ff ff 83 fe 0a 0f 86 b3 fd ff ff 80 3d 77 ed 3c 02 00 75 09 c6 05 6e ed 3c 02 01 <0f> 0b 45 31 ff e9 e5 fe ff ff f7 c2 00 00 80 00 75 4d f7 c2 00 00
[ 237.625734] RSP: 0000:ffffa22e816d79c8 EFLAGS: 00010246
[ 237.625736] RAX: 0000000000000000 RBX: 0000000000040cc0 RCX: 0000000000000000
[ 237.625736] RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000040cc0
[ 237.625737] RBP: 000000000000000c R08: 000000000003df00 R09: 0000000000000050
[ 237.625738] R10: ffffa22e816d7af8 R11: 0000000000000001 R12: 0000000000000000
[ 237.625738] R13: 00000000ffffffff R14: 0000000000000cc0 R15: ffffffff823cb5fd
[ 237.625739] FS: 00007f36d57fe000(0000) GS:ffff94589f580000(0000) knlGS:0000000000000000
[ 237.625740] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 237.625741] CR2: 00007f36b1157018 CR3: 000000023d434000 CR4: 0000000000f50ef0
[ 237.625742] PKRU: 55555554
[ 237.625742] Call Trace:
[ 237.625743] <TASK>
[ 237.625744] ? __alloc_pages_noprof+0x2ca/0x350
[ 237.625745] ? __warn.cold+0x8e/0xe8
[ 237.625747] ? __alloc_pages_noprof+0x2ca/0x350
[ 237.625751] ? report_bug+0xff/0x140
[ 237.625752] ? handle_bug+0x58/0x90
[ 237.625754] ? exc_invalid_op+0x17/0x70
[ 237.625755] ? asm_exc_invalid_op+0x1a/0x20
[ 237.625756] ? __gup_longterm_locked+0x5ad/0xa00
[ 237.625758] ? __alloc_pages_noprof+0x2ca/0x350
[ 237.625759] ? __gup_longterm_locked+0x5ad/0xa00
[ 237.625760] ___kmalloc_large_node+0x67/0x100
[ 237.625762] __kmalloc_large_node_noprof+0x21/0xb0
[ 237.625763] ? __get_user_pages+0x108/0x7e0
[ 237.625765] __kmalloc_noprof+0x2e0/0x490
[ 237.625767] ? __gup_longterm_locked+0x5ad/0xa00
[ 237.625768] __gup_longterm_locked+0x5ad/0xa00
[ 237.625770] pin_user_pages+0x6e/0xb0
[ 237.625772] os_lock_user_pages+0xbc/0x1b0 [nvidia]
[ 237.625890] RmCreateOsDescriptor+0x6b/0x110 [nvidia]
[ 237.626062] RmIoctl+0xb8d/0xd60 [nvidia]
[ 237.626229] ? portSyncSpinlockAcquire+0x1d/0x50 [nvidia]
[ 237.626334] rm_ioctl+0x66/0x4f0 [nvidia]
[ 237.626504] ? __check_object_size+0x21c/0x230
[ 237.626507] nvidia_unlocked_ioctl+0x53b/0x8d0 [nvidia]
[ 237.626572] __x64_sys_ioctl+0x94/0xd0
[ 237.626574] do_syscall_64+0x82/0x160
[ 237.626576] ? do_user_addr_fault+0x55a/0x7b0
[ 237.626579] ? exc_page_fault+0x7e/0x180
[ 237.626581] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 237.626584] RIP: 0033:0x7f3735125f2d
[ 237.626598] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 237.626599] RSP: 002b:00007f36d57d4190 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 237.626601] RAX: ffffffffffffffda RBX: 00007f36d57d4290 RCX: 00007f3735125f2d
[ 237.626602] RDX: 00007f36d57d4290 RSI: 00000000c0384627 RDI: 0000000000000012
[ 237.626602] RBP: 00007f36d57d41e0 R08: 00007f36d57d4290 R09: 00007f36d57d42b8
[ 237.626603] R10: 00007f36d57d4400 R11: 0000000000000246 R12: 00000000c0384627
[ 237.626604] R13: 0000000000000012 R14: 00007f36d57d42b8 R15: 00007f36d57d4200
[ 237.626605] </TASK>
[ 237.626605] ---[ end trace 0000000000000000 ]---
[ 237.628083] Cannot map memory with base addr 0x7f34c2000000 and size of 0x1d1f98 pages
I use open source version of nvidia driver compiled locally using akmod.
How I can find root cause of the bug and fix it in the opensource version of nvidia driver?
I compiled and signed my local version of Fedora kernel from f40 git branch, unlocked and with kmemleak. However, kmemleak finds no leaks. Maybe I use it wrong way, or nvidia doesn’t support kmemleak (I have no idea).
I have unlocked kernel now, so I can use BPF. I can also try to install unlocked debug kernel.
You will need to report to nvidia to investigate I expect.
I have the same issue with Google DeepVariant.
All processes have ended, but the RAM usage is still very high.
# smem -wp
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 93.85% 5.43% 88.42%
userspace memory 2.98% 1.01% 1.97%
free memory 3.17% 3.17% 0.00%
I have these nVidia versions, I think I may be using a beta driver. Will try different versions later today or soon.
akmod-nvidia.x86_64 3:565.57.01-1.fc41 <unknown>
kmod-nvidia-6.11.10-300.fc41.x86_64.x86_64 3:565.57.01-1.fc41 @commandline
kmod-nvidia-6.11.8-300.fc41.x86_64.x86_64 3:565.57.01-1.fc41 @commandline
I wasn’t able to get any other driver versions to work. I’m using the version from rpmfusion (now updated to “565.57.01-2.fc41”). I’ve also had similar memory problems using Tensorflow interactively through Python.
Phronix reported nvidia release the stable 565.74(?) build.
I wonder if when that is packaged in rpmfusion you have a fix?