We’ve been deploying an OKD(4.6.0-0) cluster with FCOS powered worker nodes, but we’re running into an issue where every so often a random worker node(bare-iron or KVM guest) starts reporting a skyrocketing load average, and eventually becomes completely unresponsive(via ssh or remote HW console).
There never seems to be anything logged in the system logs(outside of normal chatter), and when we run top on the ‘dying’ node(if we can catch it before it falls over) the system actually looks idle(based on io/memory). One thing we have seen is that there does appear to be a trickle of network traffic from the affected node(base on tcpdump captures) so the system is doing ‘something’.
Any suggestions on debug ideas or kernel flags we can use at boot to try and work out what’s going on?
One way to help yourself and others narrow things down is to use must-gather.
If you want, join us over in #openshift-dev on the Kubernetes slack service. There are more eyes to look things over.
While we did run into issues with Thanos, our CI wizard found some mitigations which have prevented it from biting us.
WRT to the ‘must-gather’, as the node becomes completely unresponsive we are unable to gather a lot of details until it’s rebooted at which point the problem is ‘fixed’ for some random period of time.
Hence why I was asking if there was some magic kernel parameter we could set for FCOS to help us gather more data.
But we can try to grab the must gather next time a node starts dying.