Investigating igb network driver failure to keep speed negotiation - coming up short

Problem statement:
Need help how to debug what I think may be a kernel issue with the igb module - on 3 different interfaces, two different NIC hardware (both intel NICs) I get the same issue: The negotiated speed jumps to 100mbps and when resetting the interface back to 1gbps.

Fedora: 33 and 34 prerelease (both affected).

A bit of background because I know everyone will focus on a bad cat6 cable. This is the 4th cable I’m using. It’s the 2nd switch I’m using, it’s the 2nd NIC card I’m using. The only thing that “is the same” is Fedora - the motherboard and the CPU/memory.

Here’s an example from dmesg of what I see:

[   71.030408] igb 0000:06:00.0 enp6s0f0: igb: enp6s0f0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[  141.767864] igb 0000:06:00.0 enp6s0f0: igb: enp6s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[  233.180428] igb 0000:06:00.0 enp6s0f0: igb: enp6s0f0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[ 1452.904227] igb 0000:06:00.0 enp6s0f0: igb: enp6s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 1593.918465] igb 0000:06:00.0 enp6s0f0: igb: enp6s0f0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX

It tends to settle on 100Mbps but it can change particular during load which of course is not optimal.

What I see happening is in ethtool the “advertised link modes” is reduced to not include 1Gbps:

[peter@boss ~]$ sudo ethtool enp6s0f0
Settings for enp6s0f0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Full
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: 100Mb/s
	Duplex: Full
	Auto-negotiation: on
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	MDI-X: off (auto)
	Supports Wake-on: pumbg
	Wake-on: d
        Current message level: 0x00007fff (32767)
                               drv probe link timer ifdown ifup rx_err tx_err tx_queued intr tx_done rx_status pktdata hw wol
	Link detected: yes

But after executing “nmcli c up ” I get these advertised links:

    Advertised link modes:  1000baseT/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported

Which stays put for a while and then changes to 100Mbps and the first output. Using ethtool -r does not cause this - only the nmcli c up seems to have a chance of changing the negotiated link.

Hardware wise (from lspci -k):

06:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
Subsystem: Intel Corporation Device a02f
Kernel driver in use: igb
Kernel modules: igb

I tried to add a modprobe for igb where I add “debug=16” but this is never activated - in /etc/modprobe.d/net-igb.conf I have:

options igb debug=16

Which according to modinfo should turn on full debugging. However, the module is never loaded with parameters:

ls /sys/module/igb
coresize  drivers  holders  initsize  initstate  notes  refcnt  sections  taint  uevent

(note the missing “parameters” directory).

So I’m running out of ideas on how to debug this. How can I tell what causes the negotiation to change? I’ve used two different nic cards - and the built-in on the motherboard (also igb) and they all provide the same result. Different cat6 wires, different switches. Same result. I may have an old e1000 card somewhere. That’s about my last resor - just avoiding igb.

I had an old cheap server once that would tend to go down while under load (i.e. at the worst possible time). I think the problem was that too many things were sharing the same interrupt line and it would start missing them when the scsi card and nic and video card were all running hard. Maybe check out /proc/interrupts and see if anything interesting stands out. See if you get different results if you move the card to different slot or move interrupts to different CPUs with irqbalance.

If you suspect the kernel, you should be able to install and try older ones. But I would think there are enough igb users that such a severe problem wouldn’t go unnoticed for long.

It’s a bit of a long shot, but that’s all I got. Good luck.

If you suspect the kernel, you should be able to install and try older ones. But I would think there are enough igb users that such a severe problem wouldn’t go unnoticed for long.

Since I’ve had this issue for a while and decided to try with F34 preview to see if a newer kernel would have resolved anything, I don’t think it’s a matter of just turning back to a slightly older kernel version.

What I’m really stuck on is being unable to get debug information to see if it is an issue the kernel module is reporting or perhaps getting bad hardware signals being a bad “bios” or worse. I don’t like feeling around in the dark like this with no clue to what’s going on.

Would you know why the debug=16 isn’t taking on the modprobe setting?

My only guess there is that the igb module is being loaded from the initramfs, rather than after the real rootfs is mounted. If so, and your /etc/modprobe.d/igb file isn’t being included in the initramfs, then the settings you’ve supplied wouldn’t be applied.

Oh!, I just noticed that you didn’t name your config file with a .conf suffix. Sometimes the parsers are written to only include files with a certain suffix. You might want to try renaming your config file to igb.conf.

As a workaround, you might try pass parameters directly to modprobe?

/etc/modprobe.d/net-igb.conf is the name of the file. So I’m obviously not understanding your comment here. Are you talking about a different file?

Sorry, I just misread something somewhere. My mistake. (I think I had glanced at ls /sys/module/igb and crossed that with /etc/modprobe.d/net-igb.conf)

If it’s been around for a while, then it might be related to the meltdown and spectre patches. You could try adding mitigations=off to your kernel command line. But, of course, that has huge security implications.

Well, not THAT long. Started around a month ago perhaps 6 weeks. Took me a while to identify my bad connectivity on web conferences and more before I identified the issue. After messing a bit with the current setup I decided to use a dedicated NIC instead of the built in. All tests - over 4 different ethernet chip drivers (all Intel, all using IGB) I was pretty sure it wasn’t the hardware. Trying different cables and finally replacing the switch - nothing has worked over the weeks. My biggest stumble block is not understanding why I cannot get more debug information other than ‘hey I’ve connected at this speed’.

One of my attempts was to go to F34 preview - that has caused other issues and I sorta regret this big step but that’s what you get from going on a branch that isn’t done.

Thanks for your perspective. In a way the fact that it wasn’t a “hey, you forgot to do X” is good to know. But in another way that’s what I was hoping to get out of this. Thanks again for trying.

General update - I’ve begun to see erratic USB/hardware behavior - even when in the system “bios” settings it’s receiving at times random signals from the USB keyboard and mouse and things move like crazy. On some boots, the keyboard/mouse (both USB) will not respond until I move the plug. Since I now see this not using Fedora I think something “else” is wrong here. The system is not that old and has been rock steady with Fedora until this started. I’ve ordered a new BIOS (it will not flash it - that requires Windows) and I’m hopeful this will help mitigate things. fwupdmgr somehow has trouble getting updates to the board. It DID at one time but the versions things are now are years behind the current vendor supported version - so I wonder if this issue has crept in slowly due to something else.

I’ll update here if a BIOS update is all that it takes to get back fully operational. In the mean time, good ideas on debugging the physical link on the wire are much appreciated.

Just FYI, you might find this short article interesting as it describes how missing interrupts can cause hardware devices to run slowly: Improving lost and spurious IRQ handling [LWN.net]

Again, sorry that I misquoted you earlier. Good luck with your troubleshooting and do let us know what the problem ends up being. It is an interesting problem.

It looks like the debug symbols have been moved to a separate file at compile time:

$ objdump -g igb.ko 

igb.ko:     file format elf64-x86-64

Contents of the .gnu_debuglink section (loaded from igb.ko):

  Separate debug info file: igb.ko.debug
  CRC value: 0x1b883d10

I wonder if the igb.ko.debug file would need to be placed under /lib/modules/5.10.21-200.fc33.x86_64/kernel/drivers/net/ethernet/intel/igb for debug=16 to work?

I’m a few days late here, but I just found this thread through ddg, and I’m having nearly the exact same problem.
I’ve tried at least 5 cables (4 of which i know for sure work with gigabit), 2 switches, and multiple reboots. Hell, I can get 1gbit transfer speed between two of my server machines on my network (through the same switch!), but my workstation is for some reason limited to around 100mbit when talking to either of them.

I have a close to identical ethtool output:

Settings for enx00e04c784e6c:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full 
                        100baseT/Half 100baseT/Full 
                        1000baseT/Half 1000baseT/Full 
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 100Mb/s
Duplex: Half
Port: MII
PHYAD: 32
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00007fff (32767)
                       drv probe link timer ifdown ifup rx_err tx_err tx_queued intr tx_done rx_status pktdata hw wol
Link detected: yes

There’s only one difference between us: I’m on debian.
My uname

5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux

I think I’ve also narrowed this down to igb, but tbh I’m not well versed in compiling the kernel.

Here’s an interesting dmesg snippet. The top is using my thunderbolt 3 dock, and the bottom is using a usb gigabit ethernet adapter.

[168974.228970] igb 0000:0a:00.0: added PHC on eth0
[168974.228973] igb 0000:0a:00.0: Intel(R) Gigabit Ethernet Network Connection
[168974.228975] igb 0000:0a:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 64:4b:f0:xx:xx:xx
[168974.229115] igb 0000:0a:00.0: eth0: PBA No: 000300-000
[168974.229117] igb 0000:0a:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[168974.231273] igb 0000:0a:00.0 ens1: renamed from eth0

[168977.273378] igb 0000:0a:00.0 ens1: igb: ens1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[168977.274032] IPv6: ADDRCONF(NETDEV_CHANGE): ens1: link becomes ready

[168981.332999] igb 0000:0a:00.0 ens1: igb: ens1 NIC Link is Up 100 Mbps Half Duplex, Flow Control: None
[168981.333003] igb 0000:0a:00.0: EEE Disabled: unsupported at half duplex. Re-enable using ethtool when at full duplex.

[185480.118689] usb 2-1: new SuperSpeed Gen 1 USB device number 6 using xhci_hcd
[185480.139551] usb 2-1: New USB device found, idVendor=0bda, idProduct=8153, bcdDevice=30.00
[185480.139561] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[185480.139566] usb 2-1: Product: USB 10/100/1000 LAN
[185480.139569] usb 2-1: Manufacturer: Realtek
[185480.139573] usb 2-1: SerialNumber: 000001
[185480.274816] usb 2-1: reset SuperSpeed Gen 1 USB device number 6 using xhci_hcd
[185480.300153] r8152 2-1:1.0: firmware: failed to load rtl_nic/rtl8153a-4.fw (-2)
[185480.300162] r8152 2-1:1.0: Direct firmware load for rtl_nic/rtl8153a-4.fw failed with error -2
[185480.300169] r8152 2-1:1.0: unable to load firmware patch rtl_nic/rtl8153a-4.fw (-2)
[185480.331528] r8152 2-1:1.0 eth0: v1.11.11
[185480.371011] r8152 2-1:1.0 enx00e04c784e6c: renamed from eth0
[185483.241542] IPv6: ADDRCONF(NETDEV_CHANGE): enx00e04c784e6c: link becomes ready
[185483.241909] r8152 2-1:1.0 enx00e04c784e6c: carrier on
[185483.275527] r8152 2-1:1.0 enx00e04c784e6c: carrier off
[185487.145930] r8152 2-1:1.0 enx00e04c784e6c: carrier on

I’ve talked to some network engineer friends of mine, and they have absolutely no idea what the problem is, and I’m in a similar boat. Let me know if you happen upon a solution.

https://www.dell.com/support/kbdoc/en-us/000134483/resolving-issues-with-energy-efficient-ethernet-eee-or-green-ethernet

This article explains that both ends need to support EEE otherwise the negotiation about speed never ends. So if you can switch off EEE you will might have peace.

If I see “half duplex” anywhere it’s definitely a cable error. I see full duplex but just different advertised speeds. Besides

ethtool --show-eee enp6s0f0
netlink error: Operation not supported

right, but I’ve tried 5 different cables, 4 of which I’ve confirmed myself work at 1gbit. (and all of which are cat 6).

What do you mean disable EEE? Would that have a measurable effect on my laptop battery life?

I may have found a root cause and a work around. This issue seems related to 1026359 – Unstable link speed with e1000e module

The TL;DR version: The tuned profile used includes “powersave” features that forces the renegotiation and in power-save mode it seems the NIC card changes it’s abilities and the speed gets lower.

Fix: Change tuned profile to a non-powersave mode (I switched to “desktop” from 'desktop-powersave") or add a custom profile and include a:

[net]
devices=eth0,eth1 etc.

where eth0,eth1 are interfaces you want to be treated with powersave - the ones left out aren’t touched.
Still testing but so far I haven’t seen the speed flip in 30 minutes.

To extend on this work-around - from the tuned.log it’s clear it was the culprit …

# grep enp6s0f0 /var/log/tuned/tuned.log
2021-03-27 12:40:12,387 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 12:41:12,616 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 18:02:18,933 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 18:05:49,200 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 19:46:41,238 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 19:49:21,490 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 20:02:21,922 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 20:03:22,133 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 20:04:02,331 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 20:05:02,552 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 20:54:33,659 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 20:55:33,859 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 21:16:14,431 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 21:17:14,651 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 21:20:24,907 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-27 21:22:45,156 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-27 21:23:56,894 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 16:05:55,898 INFO     tuned.plugins.base: instance net: assigning devices enp6s0f0, enp6s0f1
2021-03-28 16:06:46,283 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 16:08:28,362 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 16:10:08,642 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 16:17:48,957 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 16:19:39,192 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 16:23:29,453 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 16:24:29,672 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 16:30:09,967 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 16:32:50,219 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 16:36:30,484 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 16:37:50,710 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 20:57:25,370 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 20:58:35,591 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 20:59:25,810 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-28 21:00:46,034 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-28 21:35:46,262 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-29 10:32:19,193 INFO     tuned.plugins.base: instance net: assigning devices enp6s0f1, enp9s0, enp6s0f0
2021-03-29 10:33:10,000 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-29 10:37:51,802 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-29 10:38:52,023 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-29 10:45:42,332 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-29 10:46:42,551 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-29 10:55:52,906 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed
2021-03-29 10:58:23,153 INFO     tuned.plugins.plugin_net: enp6s0f0: setting 100Mbps
2021-03-29 11:35:17,016 INFO     tuned.plugins.plugin_net: enp6s0f0: setting max speed