I have an very odd and frustrating issue with a Fedora 38 server, where network goes down for a second or two exactly every 30 minutes and 10 seconds.
I have checked the logs and there is absolutely nothing reported, dmesg shows nothing. ifconfig is normal, netstat shows nothing, ethtool shows nothing.
I perform a tcpdump snapshot, at the time I know the server will go down, and all it reports is the fact that my test pings fail to get a response. There’s nothing else untoward in the snapshot.
All servers are on the same switch (Cisco SG300-52) and I’m pretty sure it’s the actual server at fault and not the switch as there’s nothing logged on the switch to indicate that it closed the port at any point.
So, I’m absolutely stumped and have exhausted pretty much every testing procedure I can think of.
What exactly does “goes down” mean?
What log corresponds to this state change?
When does the 30mins and 10seconda start? At boot?
What if you reboot at the 15 min point?
What if you turn off and back on the interface at the 15 min point?
By “goes down” in this instance I mean that it cannot reach anything, in or out. The interface still appears to be up, just that it cannot reach anything, just for a second or two.
When does the 30mins and 10seconda start?
That’s a really good question and one I should know the answer to. I totally should have tried that - I will schedule a reboot, probably be best for tomorrow morning. I’ll let you know the outcome.
Switch port has been changed but not the cable. I will change that when I reboot.
Check cable routing and look for things with 30min.+10sec timers.
The first IBM PC my lab bought crashed every day at 4PM. I checked the line voltage – it was dropping to 90V. The outlet we used (older building) was mistakenly connected to the ventilation system where some large motor on a timer was set to switch on at 4PM.
You checked crontab, /etc/cron and systemctl list-timers? (I am sure you know this)
Is there any monitoring software polling for state info from the server or the switch?
systemctl list-timers is a new one for me, thanks for that. Will use that in the future. But alas, nothing in there either.
However, I don’t think it is the server after all.
Rebooted at 08:08 (British Summer Time zone) forgot that this server being an HP takes a few minutes to come up and was a little close to the time it was due to go offline. So I rebooted again at 08:18 and it came back at 08:22, here’s the results of the times it goes offline.
06:12:24
06:42:34
07:12:44
07:42:54
Reboot at 08:08, back online at 08:12
08:13:04
Reboot again at 08:18, back online at 08:22:07
08:43:14
09:13:24
So from that, it does not appear to have changed. So it’s looking unlikely to be the server.
External inputs are power, network, and RF. I agree that network seems the most likely to have timers, but it may be helpful to rule out the other two.
Power Quality Monitoring lists the prime suspects for power problems (having worked in an oceanographic institute where sensitive lab gear was often used at sea, with generators and lots of RF signals from radar and two-way radio, power glitches and RFI were all too common) . If there some timed RF signal it may be that the problem server has a defect in EMI shielding/filters.
Can you swap out the server’s power supply and/or network interface?
Can you record power timelines from the system board using an external recorder? I have seen electronics fail in a timed mode due to a component that went out of spec when warm, but the time to fail was measured from power on.
Is the problem server’s location more exposed to external RF signals? Are you near any powerful transmitters (military base, shipping, airport)?