Fedora CoreOS and the Single-Node Use Case

Since it doesn’t get a lot of press these days compared to the kind of cluster deployment where a container doesn’t care which, or how many, nodes it’s running on, I thought I would describe how we want to use FCOS and containers and then ask a couple of questions that came up as I read through the issue tracker discussions.

We currently run about 7,000 geographically dispersed (mostly bare metal) nodes. Future plans call for significantly increasing both the number of nodes and the degree of dispersion. With some exceptions, each node is configured individually and can be thought of as a distinct single-node installation. Some of this is due to their geographical location and network topology, and some because of functional needs. Our interest in FCOS stems from characteristics like the immutable core; fast, automated updates; iPXE+ignition as a way to do initial configuration; the provision for fallback on boot failure (a big deal when nodes are 2,000km away); the option to seamlessly integrate node-local applications with computationally-intensive functions via something like a Kubernetes cluster set up in parallel; the kind of coordination and management protocols and tools that go into wide deployments; and finally just because it’s Fedora.

We want to use containers in order to have a degree of separation and independence between the node OS and the applications running on a node, to separate applications running on the same node, and to enable more efficient development of applications without (too much) concern for host system library compatibility, etc. All of the classic, pre-cluster motivations for containers.

So my first question is whether people think this even makes sense, either on its own or within the context of your plans and vision for Fedora CoreOS? If not, it would be good to know that sooner rather than later!

Then, just looking at, for example, the network managment issue, there seems to be some disagreement about support for the single node case, for multi-NIC hosts, or for the ability to change the configuration of the host after the ignition script has run (without requiring reboot or re-initialization). Although I’m sure there are many use cases where such things don’t matter much, lack of support for any of these would be fatal for us, so it would be good to know what people are thinking/planning.

I thought I would post this and see what people say. If FCOS isn’t the right way to go, that would be valuable knowledge, but if it is a good choice, then maybe knowing that there are use cases like this will be helpful going forward?

Thanks very much for any feedback!

4 Likes

I see no reason we wouldn’t support use cases like yours. For example, you can see some discussion of reboot management where we touch on single node vs clusters.

Thanks! That’s encouraging!

Baremetal support should always be core feature.

1 Like

Yup, the single-node model is one of our primary use cases. You didn’t say how you want node reboots to be scheduled, which is important for upgrades in single-node deployments, but as @walters said, Fedora CoreOS will have some built-in tooling to help you.

IMO the network management discussion is primarily about toolchain selection. We want to allow configuring networking (including multiple NICs) from an Ignition config and then not touching it thereafter. However, while we lean toward the immutable infrastructure model (reconfigure a node by reprovisioning it), we’re not going to prevent you from managing your nodes a different way.

You didn’t say how you want node reboots to be scheduled, which is important for upgrades in single-node deployments

There’s some discussion about this right now, but I think the best solution in our case would be to have a per-node maintenance window so that people who depend on a particular node or group of nodes can choose the best timing and not get an unpleasant surprise in the middle of some important work. Unlike what seems like the typical cluster situation, where nodes are anonymous and work moves among them, in our usage, each node fulfils a unique role, so an outage counts, and the ability to schedule that for an off-peak time on a per-node basis would be greatly appreciated. We have to be mindful of the ‘A’ in the C-I-A triad.

I like the idea of immutable infrastructure but have a dumb question. When you talk about reconfiguring a node by reprovisioning it, what would be the mechanism? Given bare metal nodes, my image has been that the very first time a system boots, the ignition configuration would run, and then that would be it. So… to reprovision means that you would wipe the disk and re-install an OS image + ignition configuration? Or is there some intermediate method where, say, an already running system downloads a new ignition configuration and that reprovisions the system?

Thanks!

That’s right. In my mind it is basically:

  • tweak ignition configuration
  • re-install
  • ignition will run on first boot with updated configuration

Yup, locksmith can do this today.

We’ve talked about having a “factory reset” feature which would allow re-running Ignition, but nothing has been nailed down. Otherwise, yes, it’s a full reinstall. That can be done with a network boot setup, either PXE-boot-to-install or by PXE-booting directly into a diskless production system. (I’m assuming the latter is not viable for your application, though.)

I noticed that option in locksmith and it looked good. Talking with people more after the holidays there is also interest in a kind of selective reboot option where the people monitoring the nodes would be able to say, update N systems in a first go, then if all goes well, the next 2xN, etc. We’re still trying to work it out. There is considerable uneasiness about two-week updates and the frequent reboots that entails because of the fear that multiple systems might fail to boot, not fall back, and require a site visit.

Actually can I ask if that’s been an issue in CoreOS or Atomic in the past? I realize that everything depends on configurations and other factors, but it would be helpful to have an idea whether that’s been a problem.

By the way I’m going to try to catch the IRC meeting, just to get a sense of what is going on. I’m still in learning mode.

We’ve talked about having a “factory reset” feature which would allow re-running Ignition, but nothing has been nailed down.

I’ve wondered about something like that, too, simply because it would be great to have the configuration of a host encapsulated in a series of ignition files, which could then be kept in something like a per-host git repo and tracked over time, but I imagine each “reset” would still require a reboot, which is kind of a heavy price for small changes. That said, I’m still trying to work out what functions can be containerized and so handled without requiring host reconfiguration. Maybe in the end there just won’t be very much, but for example I’m thinking about things like host Netfilter rules and interface configurations. I’m not sure what needs to be done at the host configuration level.

We want to have a locksmith mode where locksmith asks an external service for permission to reboot. That service should be able to implement something like this.

Here are some ways a node can fail after reboot:

  1. GRUB crashloops rather than booting.
  2. The kernel fails to boot at all.
  3. The kernel boots but quickly crashes.
  4. The kernel starts but fails to find the boot disk.
  5. The system boots but fails to connect to the network.
  6. The system boots but fails to start the desired services, such as dockerd.
  7. The system boots but the kernel eventually crashes.

On Container Linux, 2-4 produce an automatic rollback, 1 and 5 require manual intervention, and 6-7 require intervention which can possibly be done remotely. With Fedora CoreOS we’re hoping to extend automatic rollback to 5-6 as well, with the user able to specify which systemd units must start successfully for a boot to be considered successful.

Container Linux has occasionally had bugs that lead to node failures on upgrade, mostly kernel regressions (example) but occasionally bugs in other components such as udev or networkd. Those bugs have generally been configuration-dependent. We do our best to prevent regressions from reaching the stable channel, but it does happen, particularly with relatively uncommon software or hardware setups.

The best thing you can do to prevent regressions from affecting you – and I can’t emphasize this enough – is to run some of your machines on the testing stream and report regressions to us. Those nodes should be readily accessible if manual intervention is needed, of course, but should otherwise have the same hardware, configuration, and workload as your other nodes. Our goal is to stage changes through the testing stream whenever possible (with exceptions for major security fixes and the like), so there should be no surprises when those changes reach stable.

Thanks very much for such a detailed reply. That’s really, really helpful. Since our hosts are often in single-node installations in some far-flung locations, anything that can help maintain connectivity even in the event of problems will be of great help! I’m pushing for a serious investment in an automatic testing, trying to wring as many problems as we can out of the ‘next’ and ‘testing’ channels before it’s too late. With that in mind I had a couple of follow-up questions:

If a bug is reported in the ‘testing’ release, does that mean that the stable release will be delayed until it is found and resolved? It seems like the prudent move, depending, I guess, on the nature of the bug. A misspelling in the MOTD is one thing, a dockerd failure mode is another!

Also are there documents somewhere that describe the test setup that CoreOS and Atomic use? We’re doing budget requests right now for the next year, and I’ve been told to ask for whatever I need this time because it will be much more difficult next year. It would be great if I knew exactly what to specify, and we didn’t have to re-invent wheels that you guys have already proven in practice.

Thanks very much for your help!

We haven’t discussed any sort of detailed policy yet. For major regressions, I expect we’d either delay the stable release, or release without the individual change causing the regression.

The Container Linux test framework is kola, part of mantle. Testing is a mix of local QEMU and cloud; we test on bare metal via Packet, but we don’t currently have a bare-metal test lab. We intend to test Fedora CoreOS with kola as well.

Replicating our test setup will probably not provide a lot of value to you, because the concern is regressions that our tests don’t catch: in drivers for hardware we don’t have, or with network or container runtime configurations we don’t use, or with workloads we don’t have. So I’d recommend that you build a test setup that gives you confidence in your workloads.

Thanks! And you are right, of course. What I was interested in was the framework and the infrastructure you use, not the actual tests, although… if we end up layering additional functionality on top of the OS it might be good to run an identical set of tests as well. Right now our testing is very manual and takes far too long. I might even characterize the mindset as “fear of upgrades,” so moving people toward a more CI/CD sort of approach is going to be a challenge, and getting a good testing program in place is crucial.

What does the “re-install” means in this case?

We’ve got almost the same situation: 1000s of servers in a single node setup. Each (most of them) has 2 network interface and one of them should have to be configured to local network and usually should have to be done at the remote location. I.e. we give a preinstalled bare metal machine which is preinstalled with the host os and at the remote location only the local network configuration should have to be done. Sometimes without network connection since the network will be installed later (a few days or weeks) the our servers, but we know in advance the network configuration (but not so advance that we can send preconfigured).
So for us it’d be very important to be able to reconfigure the network without re-install the system.
Is it possible?

@lfarkas It will remain possible to reconfigure the system using configuration management tools or manual SSH. It’s not the model we’d generally recommend, but Fedora CoreOS won’t prevent it.