Expand usage of Profile-Guided Optimization (PGO) and LLVM BOLT across Fedora packages

Hi!

I want to discuss with the community the idea of expanding the adoption of more aggressive compiler optimizations for Fedora packages: Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) (the most well-known tool in this field is LLVM BOLT nowadays).

PGO and PLO help a lot with optimizing software according to many tests - I try to collect as much as possible of them in my GitHub repo - awesome-pgo.

The value of such a change is bringing faster by default software for Fedora users and consequently improving their experience with the OS. Other distros like CachyOS and ClearLinux can be a good source of recipes for checking how PGO is integrated into their package recipes.

I want to discuss this topic before creating a dozen of issues in the bug tracker like “Please enable PGO + BOLT for project X” (I already made one for the Julia package in Fedora). Maybe there is a better way of pushing such changes into the distro than creating one issue per package. I also found initiatives like this: @fedora-llvm-team/llvm-pgo-optimized Copr - but it’s only limited to LLVM packages.

Originally it was proposed at Bugzilla (only for the LLVM BOLT part) (link) but it was suggested to discuss such a change with the wider Fedora community. That’s why I created this topic.

Thank you.

yes, please don’t do that. :slight_smile:

I think it’s good to discuss up front. You may want to cross post this
to the devel mailing list ( devel@lists.fedoraproject.org ) as many of
them follow there.

Once up front discussions are had, the next step would be to make a
change: Changes policy :: Fedora Docs
basically you write up what you want to do, why, what advantages are
there for it, etc. You also make clear what part of this work you will
do, what work it adds for others.

Then the change is discussed, adjusted, discussed and finally submitted
to the Fedora Engineering Steering comittee to vote on.

My questions: Is this all just llvm? A very large part of fedora uses
gcc and not llvm, you would just not make changes to those?
Whats the cost of doing this? Slower compile times?

What ‘profiles’ are used?

Good luck with your idea/change!

1 Like

Excuse me for the so slow response and thank you a lot for the reply!

I hoped that there would be a bit more newbie-friendly way to discuss features than the mailing list - that’s why I posted the question here since I believe (my personal and humble opinion) that this forum has far fewer barriers and is more visible for users than the mailing list. But of course, I am not a person to change Fedora processes (I can only give some feedback) so I will post my idea there. Maybe modern interfaces for mailing lists became much better, so my knowledge is a bit outdated -let’s give it a try :slight_smile:

After the initial discussion, if the idea will get some positive attention, I will create a proposal via the “Changes policy” procedure. By the way, thank you a lot for the link - it’s a pretty clear process (at least according to the documentation - never tried to push something to Fedora before).

Regarding your questions.

Is this all just llvm?

No, GCC supports PGO too (even with the same compiler switches: -fprofile-generate/-fprofile-use). There are some differences between PGO implementation between LLVM and GCC but they should not be a problem in this case. So, GCC folks will be able to use PGO as well as LLVM-based compiler users.

Whats the cost of doing this? Slower compile times?

I see several “costs” of PGO for Fedora:

  • Slower build times. In a basic scenario, PGO process consists of 3 crucial steps: compile an application with Instrumentation; run the instrumented application on a representative workload (aka PGO training phase); compile the application once again with the gathered from the previous step runtime statistics. However, there are several options to mitigate multiple builds via caching gathered statistics between builds but it introduces another problem like updating cached profiles from time to time, managing storage for profiles (btw the default VCS can be used for this - this way is used by PGO in Go). If CI pipelines for some architectures cannot allocate resources for doing PGO due to a limited amount of hardware (I guess platforms like PowerPC) - maintainers can opt out of PGO for such archs.
  • Increased maintainers workload. Maintainers will need to update their package recipes to enable PGO for it. Is it worth it or not (“performance gain vs maintenance burden” tradeoff) - it depends on the package and the maintainer. For some projects (like Clang) it could be easier to enable PGO since it’s already supported in the upstream build scripts (custom CMake scripts), for other packages more scripting could be required.

What ‘profiles’ are used?

That’s a very good question! I see here several options:

  • If the project authors support PGO, they already can specify some possibly good (in their opinion) training workloads. For such cases, Fedora maintainers can reuse it.
  • If upstream doesn’t have a good training workload, the “inspiration” for that can be taken elsewhere. From my experience, I can recommend taking a look at CachyOS package recipes (since they care about PGO a lot) - GitHub - CachyOS/CachyOS-PKGBUILDS: PKGBUILDs for CachyOS or in the “awesome-pgo” repo, where I try to collect such workloads for different projects.
  • If none above is available, the maintainer can decide to implement their own training scenario according to their knowledge about the package. Of course, it will require an additional amount of resources from the maintainer for implementation.
  • If the maintainer won’t spend their time on it, just skip enabling PGO for the package :slight_smile:

A small comment regarding the second part - LLVM BOLT. LLVM BOLT is an additional optimization tool that allows us to perform some optimizations that are not available (at least yet) with PGO. The main thing that it does is code layout optimization for reducing CPU instruction cache misses. From the maintainer’s perspective, it’s similar to PGO: instrumentation, training phase, and optimization. However, nowadays LLVM BOLT (and similar tools in this domain like Google Propeller and Intel TLO) have more limitations compared to PGO so a discussion about them probably can be postponed. From my perspective, PGO integration has more benefits with less implementation overhead compared to LLVM BOLT.

For keeping cross-references: the mailing list post is here - Expand usage of Profile-Guided Optimization (PGO) and LLVM BOLT across Fedora packages - devel - Fedora mailing-lists

I want to highlight another important question that forgot to mention earlier - reproducibility, since it can become a blocker for enabling PGO for packages.

Since a PGO profile is an additional input to the compiler, it can become a headache for reproducible scenarios. I think the easiest mitigation here is compiling with pre-done profiles, that are saved into a VCS and reused during builds - in this case, the build will remain reproducible.

This question is raised in almost all discussions about PGO (CachyOS devs, maintainers from other OSes), and e.g. here: Mention reproduciblity · Issue #8 · zamazan4ik/awesome-pgo · GitHub

Hello.

I wanted to let you know that we’re actively working on building LLVM (clang, llvm, etc.) with PGO for Fedora. I can totally relate to problems with long build times or simply long feedback cycles due to many recompilations. But in the end I think it will be worth it.

We’re seeing a 22% performance improvement in compile time on rawhide on x86_64.

I’ve learned a lot when doing my PGO-experiment with the still unmerged LLVM packages last year. Some of the things are worthless but it’s good to see that the performance improvement saw some gain from 9,7% to ~22%. Also, everything is much more streamlined nowadays and we can do the full PGO pipeline in one build instead of requiring multiple Copr projects with different sets of build conditions and complicated profile merging in a background process.

Workload

Currently we use the llvm-test-suite as workload. You can look around in my branch and look for places in which the pgo build condition is used. I must emphasize that we’re not yet done and we still need some thorough testing in order to land this.

Build times

Here’s a screenshot of build times on fedora-rawhide-x86_64 for (a) the old (unmerged) llvm packages until around March 2024. Then we see a new line showing up (b) that shows the build times of the merged llvm packages. The PGO build times (c) are rendered in the right most side. The bottom picture shows these PGO build times in more detail. Don’t be fooled by the drop in compile time around November 2023. That is when we started using high-performance x86_64 builders on copr.

Performance comparison

During each build we run a performance comparison of the system clang against the just built PGOed clang. And that is where we see the 22% increase in compile performance. That said, each of our build logs (e.g. this failing one) now contains a section like the following.

Result of Performance comparison between system and PGOed clang
+ echo 'Result of Performance comparison between system and PGOed clang'
+ cat /builddir/build/BUILD/llvm-20.0.0_pre20241029.g757d0e4764fffc-build/performance-of-pgoed-clang/results-system-vs-pgo.txt
Tests: 7
Metric: compile_time

Program                                       compile_time                                              
                                              19.1.0       pgo-20.0.0~pre20241029.g757d0e4764fffc diff  
tramp3d-v4/tramp3d-v4                          18.81        16.41                                 -12.8%
mafft/pairlocalalign                            6.89         5.54                                 -19.5%
sqlite3/sqlite3                                 8.19         6.56                                 -19.9%
consumer-typeset/consumer-typeset               6.56         5.21                                 -20.6%
SPASS/SPASS                                    10.02         7.88                                 -21.4%
Bullet/bullet                                  19.58        14.95                                 -23.7%
kimwitu++/kc                                    9.83         7.36                                 -25.2%
                           Geomean difference                                                     -20.5%

When you install a PGOed clang we will also install these two files:

/usr/share/llvm-pgo.profdata
/usr/share/results-system-vs-pgo.txt

Both files are purely informative. The first one is the merged PGO profile (~15 Megabytes). Right now we’re regenerating this data on every build. The second contains the output from performance comparison above.

The build log file also contains this section which lists the Top 10 functions with the largest internal block counts:

+ /builddir/build/BUILD/llvm-20.0.0_pre20241029.g757d0e4764fffc-build/bootstrapped-llvm/bin/llvm-profdata show --topn=10 /builddir/build/BUILD/llvm-20.0.0_pre20241029.g757d0e4764fffc-build/pgo-O3-profiles/O3.cmake.profdata
+ /builddir/build/BUILD/llvm-20.0.0_pre20241029.g757d0e4764fffc-build/bootstrapped-llvm/bin/llvm-cxxfilt
Instrumentation level: IR  entry_first = 0
Total functions: 39991
Maximum function count: 6259866296
Maximum internal block count: 6255990084
Top 10 functions with the largest internal block counts: 
  llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 6259866296
  llvm::hash_code llvm::hash_combine<unsigned int, unsigned long>(unsigned int const&, unsigned long const&), max count = 3488439024
  llvm::SmallVectorTemplateBase<unsigned int, true>::push_back(unsigned int), max count = 2960419755
  llvm::hash_code llvm::hash_combine<llvm::MachineOperand::MachineOperandType, unsigned int, unsigned int, bool>(llvm::MachineOperand::MachineOperandType const&, unsigned int const&, unsigned int const&, bool const&), max count = 2625039396
  llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 1747693109
  llvm::APInt::APInt(unsigned int, unsigned long, bool, bool), max count = 1497545994
  llvm::SmallPtrSetImplBase::find_imp(void const*) const, max count = 1226146744
  llvm::APInt::operator=(llvm::APInt&&), max count = 799644853
  llvm::hash_code llvm::hash_combine<llvm::MachineOperand::MachineOperandType, unsigned int, long>(llvm::MachineOperand::MachineOperandType const&, unsigned int const&, long const&), max count = 717543351
  llvm::hash_code llvm::hash_combine<unsigned int, llvm::Type*, llvm::hash_code>(unsigned int const&, llvm::Type* const&, llvm::hash_code const&), max count = 635070555

When you’ve installed a PGOed clang you can as well do this inspection by running:

$ llvm-profdata show --topn=10 /usr/share/llvm-pgo.profdata | llvm-cxxfilt

Outlook

Let’s hope we can finish this quickly so everybody using clang in Fedora (>=f41) can benefit from this performance improvement. Of course, there’s other work related stuff that requires my attention.

Cheers

2 Likes