Fedora/RISCV: Should we disable LTO (GCC; already disabled for Clang)?

I have been thinking about disabling LTO (globally) for riscv64 for some weeks/months now. All Fedora/RISCV builds (currently) happen natively on SiFive HiFive Unmatched boards. The SoC (FU740) in Unmatched does not have a strong single thread performance. There are no other boards on the market (available now or in near future) that have significant improvement in the single threaded performance. LTO hits us hard as it’s mainly single threaded, and it typically requires more memory. Even running 64-core SOPHGO SG2042 with T-HEAD C920 cores we only would be getting 4-5x increase in build packages for a very large packages (e.g. GCC). We continue to be bottle-necked by a LTO link stage, which in some cases could take even hours. Most likely this is not a problem once we get newer hardware in 1-2 years. SOPHGO recently announced SG2380 (2.5GHz 16-core SiFive P670 OoO cores) to be available on the board next year. There is also Ventana with server SoC/chiplets, but unknown when that would be available or reachable for majority of folks. There is also SG2044 (updated SG2044) scheduled for next year and might improve things. Most of these things are most likely towards the end of next year (if it doesn’t slip into 2025).

We (RISCV) does not have ld.gold linker support. To my knowledge no company is actively working on it (ever). PLCT listed it in their roadmap for 2023, but reading their weekly minutes there are no activities on this item. I also looked at RISE ( https://riseproject.dev/ ) priorities for 2023 2H and 2024 1H, and their JIRA tickets. This is not on any list. Not having ld.gold means we aren’t supporting LTO with LLVM/Clang + binutils (ld.bfd) toolchain setup today (well, ever).

The linkers that are actively being worked on for RISCV are ld.bfd, lld, and mold.

Disabling LTO means we could produce more builds per day, which is highly beneficial for us (it’s hard for us to keep up with upstream Koji).

Are the any with a strong objections on this?

Thanks for the input.

1 Like

I did my own timings of a few package builds and disabling LTO makes a significant difference. Here are the times for 4 package builds on various systems, with LTO on and off.

Times are in seconds, larger numbers are worse.

qemu-system-riscv64 16 vCPUs, 16 GB on AMD Ryzen 9 7950X host

               binutils     openssl      python3.12   mingw-gcc
  (LTO)        4493         3052        14502        12428
  (no LTO)     3267         1351         6353        (failed)

VisionFive 2

               binutils     openssl      python3.12   mingw-gcc
  (LTO)        7202         8823                     (crashed in LTO step)
  (no LTO)     3274         2059         11627

One thing you may notice is that some LTO builds are simply too large for SBCs like the VF2. But also non-LTO builds are no longer done routinely and so now they fail.

My feeling is that this is where we are right now, but in future as we get RISC-V parts with higher single thread performance we’ll be able to turn LTO on again.

Yeah, it seems unfortunate to turn off LTO… means that things will be out of sync with mainline and also not have any of the advantages of LTO, but yeah… if it needs to be off for now to get things more under control, then so be it. I would expect we would want to enable it again as soon as possible though.

The gains from LTO are usually rather small… I think that for most C programs it would be single percent. It’s probably different for huge C++ codebases, but I don’t think many people are running those on Riscv anyway. So I think it’s totally reasonable to disable LTO.

1 Like

Starfive VisionFive 2 IO bandwidth seems ok.
Parallelism/threads are definitely working at compile time.
and feel impressively comparable to 8-core x86_64/arm64 performance.

Yeah the linker hit is definitely felt when building gtk4-rs/slint/bottom examples, but they do link eventually after a significantly long wait.

Is all the linking currently done sequentially in one single-thread?

You found a bottleneck! It’s an opportunity to think and change how things get done in LTO.
There must be a way of doing this stuff in parallel
while still keeping the different sections sizes
in the destination binaries intact.

elf-wise, I am aware of text and data sections.
sure text could be done by one thread and data by another, but in theory individual text sections themselves could be shared by many threads in parallel plunking in their findings into channel/queues.

Maybe all this is done already I don’t know. Just a brain fart and Happy New Year Mr. Abdurachmanov and Mr. Jones and the rest of the KOJI/FEDORA/RISCV TEAM! Best of health and happiness for all of you and your loved ones. You guys are SUPER-HEROES :smile: