We’ve been deploying an OKD(4.6.0-0) cluster with FCOS powered worker nodes, but we’re running into an issue where every so often a random worker node(bare-iron or KVM guest) starts reporting a skyrocketing load average, and eventually becomes completely unresponsive(via ssh or remote HW console).
There never seems to be anything logged in the system logs(outside of normal chatter), and when we run top on the ‘dying’ node(if we can catch it before it falls over) the system actually looks idle(based on io/memory). One thing we have seen is that there does appear to be a trickle of network traffic from the affected node(base on tcpdump captures) so the system is doing ‘something’.
Any suggestions on debug ideas or kernel flags we can use at boot to try and work out what’s going on?
WRT to the ‘must-gather’, as the node becomes completely unresponsive we are unable to gather a lot of details until it’s rebooted at which point the problem is ‘fixed’ for some random period of time.
Hence why I was asking if there was some magic kernel parameter we could set for FCOS to help us gather more data.
But we can try to grab the must gather next time a node starts dying.