Best way to troubleshoot a bad process?

I have isolated a process (megasync) that causes issues in GNOME shell (the issue does not appear in KDE plasma, XFCE or even i3-gaps). I tested this on Ubuntu, Manjaro and Fedora GNOME. They all appear to have the same problem. When this process is running, it causes the GNOME shell to pause/hiccup for about 1 second, every 10 seconds or so. When the hiccup occurs it freezes everything.

If I stop the process (megasync) everything returns to normal. What is the best way (without the source code) to troubleshoot this issue? I reached out to MEGA.co.nz but got the boilerplate response to reboot, re-install etc. As I stated before, I tried 3 different distributions with different versions of GNOME and the issue is the same across.

My hardware:

System:    Host: thinkpadx1e Kernel: 5.6.7-300.fc32.x86_64 x86_64 bits: 64 Desktop: Gnome 3.36.1 
       Distro: Fedora release 32 (Thirty Two) 
Machine:   Type: Laptop System: LENOVO product: 20MF000DUS v: ThinkPad X1 Extreme serial: <superuser/root required> 
       Mobo: LENOVO model: 20MF000DUS v: SDK0J40697 WIN serial: <superuser/root required>     UEFI: LENOVO v: N2EET46W (1.28 )     
       date: 03/18/2020 
Battery:   ID-1: BAT0 charge: 73.2 Wh condition: 81.7/80.4 Wh (102%) 
CPU:       Topology: 6-Core model: Intel Core i7-8850H bits: 64 type: MT MCP L2 cache: 9216 KiB 
       Speed: 3743 MHz min/max: 800/4300 MHz Core speeds (MHz): 1: 3743 2: 3759 3: 3754 4: 3677 5: 3696 6: 3581 7: 3580 
       8: 3748 9: 3633 10: 3599 11: 3663 12: 3572 
Graphics:  Device-1: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile] driver: nvidia v: 440.82 
       Display: x11 server: Fedora Project X.org 1.20.8 driver: nvidia resolution: 3840x2160~60Hz 
       OpenGL: renderer: GeForce GTX 1050 Ti with Max-Q Design/PCIe/SSE2 v: 4.6.0 NVIDIA 440.82 
Audio:     Device-1: Intel Cannon Lake PCH cAVS driver: snd_hda_intel 
       Device-2: NVIDIA GP107GL High Definition Audio driver: snd_hda_intel 
       Sound Server: ALSA v: k5.6.7-300.fc32.x86_64 
Network:   Device-1: Intel Wireless-AC 9560 [Jefferson Peak] driver: iwlwifi 
       IF: wlp0s20f3 state: up mac: 24:ee:9a:7b:eb:b6 
       Device-2: Intel Ethernet I219-LM driver: e1000e 
       IF: enp0s31f6 state: down mac: 48:2a:e3:4e:a0:64 
       IF-ID-1: virbr0 state: down mac: 52:54:00:4b:f2:35 
       IF-ID-2: virbr0-nic state: down mac: 52:54:00:4b:f2:35 
Drives:    Local Storage: total: 1.82 TiB used: 876.33 GiB (47.0%) 
       ID-1: /dev/nvme0n1 vendor: Samsung model: MZVLB1T0HBLR-000L7 size: 953.87 GiB 
       ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 970 EVO Plus 2TB size: 1.82 TiB 
Partition: ID-1: / size: 68.40 GiB used: 17.01 GiB (24.9%) fs: ext4 dev: /dev/dm-0 
       ID-2: /boot size: 975.9 MiB used: 200.1 MiB (20.5%) fs: ext4 dev: /dev/nvme1n1p2 
       ID-3: /home size: 1.71 TiB used: 859.10 GiB (49.2%) fs: ext4 dev: /dev/dm-2 
       ID-4: swap-1 size: 15.49 GiB used: 4.2 MiB (0.0%) fs: swap dev: /dev/dm-1 
Sensors:   System Temperatures: cpu: 78.0 C mobo: 56.0 C gpu: nvidia temp: 56 C 
       Fan Speeds (RPM): cpu: 4180 
Info:  Processes: 434 Uptime: 17h 14m Memory: 30.86 GiB used: 8.31 GiB (26.9%) Shell: bash inxi: 3.0.38

Your starting point with be the logs. See if anything related to the hiccup is logged there when it does happen:

It could be an I/O related issue, but it’s hard to say without more information.

I started watchin the logs in the terminal: journalctl -f and looked for anything from gnome-shell or megasync and saw nothing. This is the issue I have, how do I collect more info? I ran strace and not sure how to interpret the results. So now I will investigate what the syscall futex is…

strace -C /usr/bin/megasync --debug

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 46.93    0.056627          20      2780       597 futex
  9.19    0.011092           9      1126           poll
  7.12    0.008592           3      2720           read
  4.90    0.005918           4      1292       198 openat
  3.79    0.004577           2      1748      1239 access
  3.22    0.003891           3      1142           mmap
  3.20    0.003866           2      1765           lseek
  2.54    0.003060           2      1183           close
  1.99    0.002403           2       943           write
  1.71    0.002065           2       837         1 fstat
  1.71    0.002061           2       713       530 recvmsg
  1.46    0.001767           3       529        32 stat
  1.44    0.001739           2       672           writev
  1.15    0.001391          12       108           munmap
  1.13    0.001364          34        40           socket
  1.13    0.001363           3       366        50 statx
  1.08    0.001308           2       501           mremap
  0.61    0.000738           2       322           mprotect
  0.55    0.000668          55        12           socketpair
  0.49    0.000587           4       142           fstatfs
  0.45    0.000541           2       207           lstat
  0.41    0.000492          35        14           fdatasync
  0.40    0.000478           2       197       104 readlink
1 Like

I am still not seeing anything here in the logs and I tried vmstat, still not seeing anything.

$ vmstat 4 10 -S m
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0  22243    208   5767    0    0    69    55  331  318  4  1 95  0  0
 0  0      0  22219    208   5767    0    0     0    13 2746 3163  5  1 94  0  0
 0  0      0  22242    208   5767    0    0     0    27 2502 4698  3  2 95  0  0
 1  0      0  22241    208   5767    0    0     0    76 2392 3772  4  2 95  0  0
 1  0      0  22243    208   5767    0    0     0     0 2643 3146  5  1 94  0  0
 1  0      0  22243    208   5767    0    0     0     8 2355 4064  3  2 95  0  0
 1  0      0  22241    208   5767    0    0     0   112 2985 3388  5  1 94  0  0
 1  0      0  22241    208   5767    0    0     0    85 3048 4982  4  2 94  0  0
 1  0      0  22241    208   5767    0    0     0    33 3215 5421  4  2 94  0  0
 1  0      0  22241    208   5767    0    0     0     6 3008 4023  5  1 94  0  0

and:

 $ ps -ef | wc -l
401

The slowdown occurs on line 4 and 8 of vmstat.

Edit: I collected these from the wrong host, I was logged into my ThinkCentre!

Another thing I noticed, this does not happen on my ThinkCentre. Both running Fedora 32 with GNOME, but the system with the issue is using the NVIDIA GPU, whereas the other is using a Radeon GPU. So it only seems to appear in GNOME with an NVIDIA card (see my initial entry).