I have a dell optiplex running fedora server 38. It mostly works fine except that at random moments the server just stops working. Nothing responds anymore and if i plug in a display it is blank. The hard drive indicator also no longer flashes.
The only way to get it working again is to hold the power button and restart.
However i can’t find any error logs whenever this happens. It’s like the OS doesn’t even know anything happened.
I know this is really vague but I literally can’t find anything that might point me to what is going on
edit: it does seem to only crash at night, but maybe that is just a coincidence
Sometime system seem dead when the display fails, but the kernel is still running.
Keep a log of the times when the server stopped working and when you shut it down with the power button.
Do you have another system you can use to connect to the server with SSH?
As for “error logs”, sudo journalctl should have every message the kernel produces. If the kernel as died, the times of the last journal entries for each boot should be earlier that the times when you shut the system down with the power button.
Is the server on a network that could be accessed by “bad actors” (e.g., internet or campus-wide internal network)? What IP’s had connections when the system stopped working (could be in application-specific connection logs).
unfortunately, the computer doesn’t respond to ssh requests in this state. I believe pings don’t even work.
The server is running behind our own router and the ssh is forwarded on a random port and does not accept passwords. So i don’t think people are getting into the system
There have been cases where systems were configured (for things like vendor updates) to use an outside network (dialup, wifi, or cellular) and leaked packets that would otherwise be blocked by the main router. In my lab we had our own block of IP addresses and could block connections from the rest of the campus.
On an enterprise level managed network, scans run by the IT group have sometimes caused systems to fall over. The first instance I encountered was NCD X-terminals running out of memory when hit by a scan. A mission-critical hardware controller that ran on Win2K used fall over randomly. I came in 7 days a week for year, logging every failure, and discovered that it never happened on a Sunday. I never did discover the cause, but at least I didn’t have to go to work on Sundays.