F42 Change Proposal: Optimized Binaries for the AMD64 / x86_64 Architecture (v2) (self-contained)

Optimized Binaries for the AMD64 / x86_64 Architecture (v2)

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

Wiki
Announced

:link: Summary

Individual packages can provide already optimized libraries via the glibc-hwcaps mechanism. This approach will be extended to executables. The package provides an optimized variant of a binary in a different directory. A symlink to small program which replaces the binary in /usr/bin. At runtime, this program will find the most appropriate variant and execute it.

Which packages provide the optimized code and at which level will be made by individual package maintainers based on benchmark results. A few programs/packages will be updated by the Change Owners to show how the mechanism works.

:link: Owner

:link: Current status

  • Targeted release: Fedora Linux 42
  • Last updated: 2025-01-07
  • [ Announced]
  • [ Discussion thread]
  • FESCo issue:
  • Tracker bug:
  • Release notes tracker:

:link: Detailed Description

This is an updated version of Changes/Optimized_Binaries_for_the_AMD64_Architecture.

Fedora binaries for the AMD64 / x86_64 architecture are compiled with code-generation flags that support almost all CPU variants. But newer generations of processors gained additional instructions that may be used to generate faster code. A vendor-independent x86-64 psABI supplement defines four “microachitecture levels”: x86-64-v1 (the baseline, our code targets this), x86-64-v2 (+SSE3, CentoOS targets this), x86-64-v3 (+AVX), x86-64-v4 (+AVX512) [1]. When code is compiled for a higher microarchitecture level it will crash (with SIGILL, “illegal instruction”) on CPUs which do not support it. Benchmark results show small differences in performance: usually in the range from -5% to 10%, with no discernible difference for most code, but some applications benefit, with gains of 120% in some benchmarks [e.g. 2, 4].

Over the years, various people have expressed interest in raising the required microarchitecture levels. But we have been very conservative in making changes, because support is missing in many older CPUs that are still in use, and in fact, even in some CPUs produced and sold today. By raising the required level we would make Fedora completely unusable on many machines. It also seems that recompiling all packages with the changed options would largely be a waste of resources, because for most code it makes no difference. But for some of the numerical or cryptographic code there are noticeable gains and it seems to be worth the effort to provide optimized code. This also makes Fedora more attractive to people interested in optimization.

The dynamic linker already has the glibc-hwcaps mechanism to load optimized implementations of shared objects [3]. This means that packages can provide optimized libraries and they linker will be automatically load them from separate directories if appropriate. (For AMD64, this is /usr/lib64/glibc-hwcaps/x86-64-v{2,3,4}/.)

This Change is about extending the glibc-hwcaps mechanism to executables. A small helper binary is provided. A program in /usr/bin (or another path) is symlinked to this helper. When executed, the helper checks the capabilities of the CPU and searches for the most appropriate variant of the target program in a separate directory hierarchy. If then launches one of the optimized binaries or the “generic” one compiled for the baseline.

This means that individual packages “opt in”, by moving their binary to the alternative directory hierarchy and replacing it by a symlink, and also providing one or more optimized variants.

Note: the ELF format provides the IFUNC mechanism to dynamically select a variant of a function (symbol) when an executable is loaded [5]. This is in particular used to load code using specific CPU instructions when those are supported. This mechanism is both more general (because it allows arbitrary selection criteria), more fine-grained (because there can be other variants than just a few fixed microarchitecture levels), and more efficient (because only the parts of the code that benefit from this need to be provided in multiple variants). In particular, glibc already makes extensive use of this to provide optimized code, which is then widely used by other libraries and programs. This means that even though we compile code in a way where the lowest baseline is supported, modern CPU instructions are already widely used. This is one of the reasons why compiling for a higher baseline often doesn’t make any difference in benchmarks. The IFUNC mechanism or an equivalent mechanism should generally be preferred. Nevertheless, that needs to be implemented in the program or library itself, which is not trivial. The mechanism in this Proposal is intended for the code which do not use IFUNCs or some other similar mechanism.

[1] https://hackweek.opensuse.org/all/projects/support-glibc-hwcaps-and-micro-architecture-package-generation
[2] rfcs/0002-march.rst ¡ master ¡ Arch Linux / RFCs ¡ GitLab
[3] The GNU C Library version 2.33 is now available
[4] CentOS ISA SIG Performance Investigation – Blog.CentOS.org
[5] GNU Indirect Function and x86 ELF ABIs | jasoncc.github.io

Glibc-hwcaps together with the new helper provide a generic mechanism. It will be up to individual packages to actually provide code which makes use of it. Individual package maintainers are encouraged to benchmark their packages after recompilation, and provide the optimized variants if useful. (I.e. the code in question is measurably faster and the program is run often enough for this to make a difference.)

The Change Owners will implement the packaging changes for a few packages while developing the general mechanism and will submit those as pull requests. Other maintainers are asked to do the same for their packages if desired.

Optimized variants of programs and libraries MAY be packaged in a separate subpackage. The general packaging rules should be applied, i.e. a separate package or packages SHOULD be created if it is files are large enough.

Available benchmark results [2,4] are narrow and not very convincing. We should plan an evaluation of results after one release. If it turns out that the real gains are too small, we can scrap the effort. On the other hand, we should also consider other architectures. For example, microarchitecture levels z{14,15} for s390x or power{9,10} for ppc64le. Other architectures are not included in this Change Proposal to reduce its scope.

:link: Feedback

:link: Benefit to Fedora

The developers who are interested in this kind of optimization work can perform it within Fedora, without having to build separate repositories. The users who have the appropriate hardware will gain performance benefits. Faster code is also more energy-efficient. The change will be automatic and transparent to users.

Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 uses x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu). We implement the same change in Fedora in a way that is scoped more narrowly, and thus vastly cheaper in the sense of development effort, code compilation time, storage and distribution overhead, but should provide the same performance and energy benefits.

:link: Scope

  • Proposal owners:

    • Package hwcaps-loader.
    • Find some example packages to convert (the code must do “number crunching” or string processing, and must not already use IFUNCs or glibc-hwcaps or some other mechanism).
    • Convert a few packages and submit the changes as pull requests.
    • Submit a draft change to Packaging Guidelines
    • Do benchmarks.
  • Other developers:

    • Consider converting some additional packages.
    • Review and merge the Packaging Guidelines change
  • Release engineering: #Releng issue number

  • Policies and guidelines: N/A (not needed for this Change)

  • Trademark approval: N/A (not needed for this Change)

  • Alignment with the Fedora Strategy:

:link: Upgrade/compatibility impact

:link: Early Testing (Optional)

Do you require ‘QA Blueprint’ support? N

:link: How To Test

  • Install one of the converted packages
  • Run the program. If the hardware supports the optimized variant, verify that it was ran. If the hardware does not support any of the optimized variants, verify that the baseline version was executed.

:link: User Experience

The change should be invisible to users, except that some programs may execute more quickly.

:link: Dependencies

:link: Contingency Plan

  • Contingency mechanism: Revert changes in individual packages. This can be either by the maintainers of those packages or by the Change Owners using provenpackager privileges.
  • Contingency deadline: any time really. The changes are independent between packages, so we can trivially convert and uncovert individual programs even after release.
  • Blocks release? No

:link: Documentation

N/A (not a System Wide Change)

:link: Release Notes

Last edited by @amoloney 2025-01-10T19:38:55Z

Last edited by @amoloney 2025-01-10T19:38:55Z

4 Likes

How do you feel about the proposal as written?

  • Strongly in favor
  • In favor, with reservations
  • Neutral
  • Opposed, but could be convinced
  • Strongly opposed
0 voters

If you are in favor but have reservations, or are opposed but something could change your mind, please explain in a reply.

We want everyone to be heard, but many posts repeating the same thing actually makes that harder. If you have something new to say, please say it. If, instead, you find someone has already covered what you’d like to express, please simply give that post a :heart: instead of reiterating. You can even do this by email, by replying with the heart emoji or just “+1”. This will make long topics easier to follow.

Please note that this is an advisory “straw poll” meant to gauge sentiment. It isn’t a vote or a scientific survey. See About the Change Proposals category for more about the Change Process and moderation policy.

I nominate the stockfish package, which I co-maintain, as an example package to demonstrate the best-case benefits. As a strong chess engine, it is very computationally intensive. It contains a command-line tool that is easy to benchmark. It does not ship a shared library or support runtime CPU dispatch (so it cannot benefit from currently-available approaches), and it is known to benefit significantly from higher-level ISA extensions.

(Mentioning @omos, the primary stockfish maintainer, so that they will know what I wrote here.)

2 Likes

How do I find what microarchitecture support my machine? Asking because I want to check what microarchitecture we have on builders.

The classification of levels is hazy and/or incomplete. There are x86_64 CPU which have AVX (128-bit xmm registers) but lack AVX2 (256-bit ymm registers), and this possibility is not recognized by the levels. Where does such a CPU belong?

For instance, I have two such CPUs. They are old (12 years), but they still do useful work.
Both have sse, sse2, ssse3, sse4_1, sse4_2, sse4a, popcnt, aes, pclmulqdq, and bmi1; so they satisfy floating-point arithmetic and [scalar] crypto requirements.

===== /proc/cpuinfo for one machine

processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 16
model name : AMD A10-5800K APU with Radeon™ HD Graphics
stepping : 1
microcode : 0x6001119
cpu MHz : 1400.000
cache size : 2048 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
apicid : 16
initial apicid : 16
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c
lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch
osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core
perfctr_nb cpb hw_pstate ssbd vmmcall bmi1 arat npt lbrv svm_lock nrip_save
tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed
bogomips : 7600.72
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

===== /proc/cpuinfo for the other machine

processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 56
model name : AMD A10-7890K Radeon R7, 12 Compute Cores 4C+8G
stepping : 1
microcode : 0x6003106
cpu MHz : 1696.394
cache size : 2048 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
apicid : 16
initial apicid : 16
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c
lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch
osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core
perfctr_nb bpext ptsc cpb hw_pstate ssbd vmmcall fsgsbase bmi1 xsaveopt arat
npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists
pausefilter pfthreshold overflow_recov
bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed
bogomips : 8185.07
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro [13]
=====

I am not exactly sure if this is what you are asking, but for a quick one-off, ld.so --help includes the list of supported hwcaps for your system (if the hwcaps list output does not include “supported”, it does not). One can also parse the various flags from /proc/cpuinfo (and there are scripts floating around that do so).

Do you know that the code will benefit from AVX instructions?
Simply being compute bound is not sufficient to indicate the code will be sped up.

Thank you, this is what I asked for.

I checked our COPR builders and we have some that does NOT support x86-64-v4. And x86-64-v3 is the highest level. We can route builds that require (or can utilize) higher level to newer builders, but this must be somehow indicated in SPEC or RPM level. IIRC, there is no such thing right now, so that would need an implementation in RPM.

1 Like

In theory you can build for a higher level of architecture on older builders (-march), but to test the resulting build (commonly in the %check section?) one will need to be executing on the platform architecture built to.

A question may be if it is easier (and cheaper) to add in the various software selection(s) for dispatch for building on the right hosts, or replace the builders to always be -v4 (or limit the initial choices to -v2 and -v3, which all the builders apparently support)?

These days it uses neural networks to do most of the “thinking”, so I would expect it to benefit a lot from AVX support (especially AVX-512).

1 Like

Yes, @barryascott is right in general that not all compute-bound software benefits from ISA extensions, but Stockfish does have many inner loops that are explicitly written with intrinsics, and upstream ships multiple flavors of x86_64 binary releases: base (x86_64-v1), sse41 (x86_64-v2), avx2, bmi2 (x86_64-v3), avx512 (x86_64-v4), vnni256, and vnni512.

The full list of “supported architectures” that are optimization targets for upstream is:

                 x86-64-vnni512 x86-64-vnni256 x86-64-avx512 x86-64-avxvnni x86-64-bmi2 \
                 x86-64-avx2 x86-64-sse41-popcnt x86-64-modern x86-64-ssse3 x86-64-sse3-popcnt \
                 x86-64 x86-32-sse41-popcnt x86-32-sse2 x86-32 ppc-64 ppc-64-altivec ppc-64-vsx ppc-32 e2k \
                 armv7 armv7-neon armv8 armv8-dotprod apple-silicon general-64 general-32 riscv64 \
                 loongarch64 loongarch64-lsx loongarch64-lasx))

I just did quick builds from source in a git checkout from upstream, and got the following on my workstation:

$ make -j profile-build ARCH=x86-64
$ ./stockfish bench
[…]
Total time (ms) : 1332
Nodes searched  : 999324
Nodes/second    : 750243
$ make -j profile-build ARCH=x86-64-sse41-popcnt
$ ./stockfish bench
[…]
Total time (ms) : 764
Nodes searched  : 999324
Nodes/second    : 1308015
$ make -j profile-build ARCH=x86-64-bmi2
$ ./stockfish bench
[…]
Total time (ms) : 655
Nodes searched  : 999324
Nodes/second    : 1525685

So stockfish runs almost twice as fast at x86_64-v2, and about 10-15% faster than that at x86_64-v3. I don’t have x86_64-v4 hardware.

2 Likes

In theory you can build for a higher level of architecture on older builders (-march), but to test the resulting build (commonly in the %check section?) one will need to be executing on the platform architecture built to.

I agree with this, but I would like to point out one more wrinkle: you can’t do profile-guided optimization (PGO) when building for a higher architecture level. There aren’t a lot of Fedora packages that are using PGO, and not a lot of upstream projects that provide explicit support for it, but I suspect there is a lot of overlap between the packages that support and benefit from PGO and those that benefit from ISA extensions.

(We do build stockfish with PGO, and if we couldn’t, that might cancel out the benefit of some of the higher microarchitecture levels. It would take some benchmarking to be sure. At least x86_64-v2 would still be worth it.)

The classification of levels is hazy and/or incomplete.

The levels aren’t meant to perfectly target all combinations of features in all CPU’s. They’re meant to tame the combinatoric explosion of possible features. They were chosen, as I understand it, by considering how useful specific ISA extensions are as well as how widely they were shipped. A particular machine can run code for whichever level it has all of the features for, the lowest level it fully satisfies.

There are x86_64 CPU which have AVX (128-bit xmm registers) but lack AVX2 (256-bit ymm registers), and this possibility is not recognized by the levels. Where does such a CPU belong?

These appear to be x86_64-v2. Owners of such CPU’s may be understandably disappointed that there isn’t a level that specifies AVX but not AVX2, but there is still a significant benefit to going from x86_64 to x86_64-v2 on these CPU’s since this allows the use of up to SSE4.2, plus some instructions like POPCNT and CMPXCHG16B that are very effective in certain programs.

Both have sse, sse2, ssse3, sse4_1, sse4_2, sse4a, popcnt, aes, pclmulqdq, and bmi1; so they satisfy floating-point arithmetic and [scalar] crypto requirements.

At least sse, sse2, ssse3, sse4_1, sse4_2, sse4a, and popcnt are part of x86_64-v2; aes and bmi1 are lumped in with x86_64-v3, which those CPU’s don’t satisfy. None of the levels requires aes, so programs could still only use that by runtime detection and dispatch in any case. I think this may also be true of the CLMUL instructions.

So compared to compiling for exactly your CPU, in the manner of -march=native, you wouldn’t see the maximum possible benefit on these machines, but you would still see a benefit – and targeting a limited number of feature sets is what makes this proposal potentially feasible.

I do, so I ran a very quick unscientific test on my (clearly differently performing) platform.

make -j profile-build ARCH=x86-64
....
Total time (ms) : 2796
Nodes searched  : 999324
Nodes/second    : 357412


make -j profile-build ARCH=x86-64-sse41-popcnt
....
Total time (ms) : 1652
Nodes searched  : 999324
Nodes/second    : 604917


make -j profile-build ARCH=x86-64-bmi2
....
Total time (ms) : 1400
Nodes searched  : 999324
Nodes/second    : 713802


make -j profile-build ARCH=x86-64-avx512
....
Total time (ms) : 1358
Nodes searched  : 999324
Nodes/second    : 735879

Looks like x86_64-v4 only provides a small improvement (although every few percent can help).

1 Like

Absolutely. This is the bare minimum.

My already-not-supported-anymore Laptop with an 11th Gen Intel CPU is v4.

My paperweight Thinkpad T430 is v3, and that is from 2012. That is 13 years ago.

This might be out-of-scope for this Change Proposal (which is about making this possible, IIUC), but how would package maintainers actually use this?

For example, I have no idea how I would need to modify a package to build different variants (neither for taking advantage of glibc-hwcaps nor of this mechanism). Do I need to monkey-patch %build_cflags / %build_cxxflags macros? Or how would this work for other compilers such as go or rustc?

I think it would be great if there were at least one example that shows how to take advantage of this feature in an actual package, and / or provide RPM macro knobs to do the “dirty work”.

1 Like

This might be out-of-scope for this Change Proposal (which is about making this possible, IIUC), but how would package maintainers actually use this?

That part’s TBD :wink: I expect that the mechanism will vary between packages.
So part of the work will be to figure out if there’s some common scheme
that can be applied to many packages.

Zbyszek

1 Like

Yeah, the oldest fedora builders (That will be going away later this year) do not have v4 (only up to v3).

1 Like

Wow ! :face_with_monocle: My CPU is a 13700H and is “only” v3
that said, my home lab will only support V2 for the workers and V1 for the control plane (k8s)

If the current CPU lacks some requirement(s) to support a certain level, then ld.so --help should make a positive statement which lists what is lacking. Currently the output is just

Subdirectories of glibc-hwcaps directories, in priority order:
  x86-64-v4

with no indication of why -v4 is not supported. It should say something like

x86-64-v4   (not supported; this CPU lacks avx512)