Infrastructure planning

Hey folks. We had some discussions on IRC/Matrix about the riscv infrastructure and what we should try and do moving forward. I thought it might be good to note current status and options moving forward and get input from everyone to come up with the best plan. :slight_smile:

@davidlt please do correct any mistakes I make here. :wink:

Currently, the koji hub is a machine in CA, USA. It’s got some nvme drives, but it’s not using them very efficently. It’s / drive is also small. There’s a pool of SAS drives (~100TB), but they aren’t in any machine currently. David has a backup locally to him as well as a bunch of builders.

So, I think we have basically 3 phases we need to consider:

  1. Short term: (next 6 months).

I would propose we get a 2TB nvme to make the / drive on the current hub.
David would then redo the other nvme’s there for a larger /mnt/koji volume.
This would take some downtime and syncing the data back over after reinstall, but I would think it might keep things going ok for another 6months? Perhaps by then we can catch up to rawhide?

  1. Medium term: (next 1-2 years)

This is the toughest one. Basically we need to keep things rolling along and keep up with Fedora mainline if at all possible. There’s options here.

We could move to AWS in the fedora account. This would have some nice advantages. It wouldn’t really use any of the existing hardware however. (This would cost nothing currently as Amazon happily picks up our account currently)

David could build a new server at his location, populate it with the SAS drives and move the koji hub there. This means it would be really close to a lot of the builders, but on the other hand, it might mean only David has access to fix things, etc. (This would be $s for a new server)

We could build a new hub server and ship it to the Red Hat community cage in RDU. This would allow enterprise management, on site people, good uplink, etc. We could also use the drives we have already. (This would be $ for a new server, but we could probibly use our discounts, etc if we just get a Dell or the like).

  1. Long term: (2+):

Once things are keeping up with Fedora and once more ‘enterprisey’ builders are available, we can look at merging into mainline. This would likely be a system-wide fedora change that would need to go through the process and get approved, then enable things just before a mass rebuild and build it all. :slight_smile:

Anyhow, I might have misread something or missed some good options or missed some pros and cons for something, so please do chime in if you have any opinions. :slight_smile:

kevin

4 Likes

I can’t speak to the technical merits of the medium-term choices, but I’d rank them as:

1 (best). Use AWS. They’re giving us the resources, so we might as well use them.
2. Buy a server for the Community Cage. We might still have time to squeeze in that CapEx request for the 2023 budget. Consult your financial advisor. Alternatively, there may be someone in Red Hat who could work with hardware partners to get us some free (or at least not-retail) gear as part of a partnership agreement.
3. Have David build a new server. That sounds like a recipe for putting hero work on David and I’d rather not do that.

The current Koji Hub server is:
2S, Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (Total 20C/40T)
128G of RAM
256G NVMe as the main OS drive (Samsung SSD 960 EVO 250GB)
3x SN200 6.40T, PCIe NVMe for the main storage (/mnt/koji, database, backups [sadly])

There is another 12T x 10 drives that were suppose to be cold storage and backup location, but we never got the server for it.

Two NVMes are in RRAID1 for /mnt/koji (the precious stuff). Sadly this is 96% filled right now (<300G of free space).
The last drive host Koji database and backup (we use restic). Backup repository is sycned to remote location from time to time, which is an external drive (also running out of free space, but I have a NAS to replace it).

We went with the flash storage because it provided large 4K random mixed IOPS (75% reads, 25% writes). It is basically IOPS monster and that allowed us to cook a new distro repo within 2 minutes.

The server is hosted in Fremont, CA, USA. Majority of the builders will be hosted in Lithuania at my place for now. Previously majority of builders were concentrated in USA. The server needs a new home.

I think, SMART reports that we probably moved 1PB of data from /mnt/koji (not sure how much we want to trust that). I see interfaces reporting 30+TB of data moved since reboot.

I think I could keep it going pretty much as-is. It would need rebuilding. I would most likely merge all 3 NVMes with no redundancy to a single pool (that would give us <20T for /mnt/koji). I am considering switching from XFS to Btrfs (mainly because snapshots look cleaner compared to LVM stuff). Alternatively would be to build something locally (cheap) and let the majority of the traffic go via a switch/router.

Note that we do run a very minimal setup, which is not good enough for the future. We want to match a proper Fedora infra and get as close as possible.

We will catch up to F37 and Rawhide sooner than later, but the showstopper will be running out of /mnt/koji storage. That’s easy to do with large scale builds.

1 Like

So, sounds like short term David just reorganizes things so we don’t run out of space.

For medium, perhaps we could trial out aws? @davidlt do you happen to know what aws region is nearest you? I can spin up a instance and we can try it out as a secondary hub? possibly we can even sync data to it for a hot backup?

Or would you like to persue any of the other options? Or anyone else have thoughts?