Debug help needed: Fedora CoreOS OKD workers just 'stop'

eclipse-foundation · March 31, 2021, 2:42pm

Hi,

We’ve been deploying an OKD(4.6.0-0) cluster with FCOS powered worker nodes, but we’re running into an issue where every so often a random worker node(bare-iron or KVM guest) starts reporting a skyrocketing load average, and eventually becomes completely unresponsive(via ssh or remote HW console).

There never seems to be anything logged in the system logs(outside of normal chatter), and when we run top on the ‘dying’ node(if we can catch it before it falls over) the system actually looks idle(based on io/memory). One thing we have seen is that there does appear to be a trickle of network traffic from the affected node(base on tcpdump captures) so the system is doing ‘something’.

Any suggestions on debug ideas or kernel flags we can use at boot to try and work out what’s going on?

jaimelm · April 1, 2021, 1:20am

Hi,

Could it be this?

https://bugzilla.redhat.com/show_bug.cgi?id=1906496

Jaime

jaimelm · April 1, 2021, 1:20am

One way to help yourself and others narrow things down is to use must-gather.
https://docs.okd.io/latest/support/gathering-cluster-data.html#about-must-gather_gathering-cluster-data

If you want, join us over in #openshift-dev on the Kubernetes slack service. There are more eyes to look things over.

Jaime

eclipse-foundation · April 1, 2021, 10:50pm

While we did run into issues with Thanos, our CI wizard found some mitigations which have prevented it from biting us.

eclipse-foundation · April 1, 2021, 10:52pm

WRT to the ‘must-gather’, as the node becomes completely unresponsive we are unable to gather a lot of details until it’s rebooted at which point the problem is ‘fixed’ for some random period of time.

Hence why I was asking if there was some magic kernel parameter we could set for FCOS to help us gather more data.

But we can try to grab the must gather next time a node starts dying.

Topic		Replies	Views
Check of ignition configs Project Discussion coreos-wg	5	379	February 8, 2021
Fedora CoreOS Survey Project Discussion coreos-wg	2	597	November 30, 2021
Trouble setting control planes on CoreOS. 1st one works, but not 2 or 3 Ask Fedora kubernetes , coreos , fcos	1	54	February 19, 2025
Fedora CoreOS Numbers 10/2024 edition Project Discussion coreos-wg	2	339	October 15, 2024
Cluster-api-provider-vsphere now has Ignition support. Does it work with Fedora CoreOS? Project Discussion coreos-wg	1	180	December 5, 2022

Debug help needed: Fedora CoreOS OKD workers just 'stop'

Related topics