F40 Change Proposal: Optimized Binaries for the AMD64 Architecture (System-Wide)

Optimized Binaries for the AMD64 Architecture

Wiki
Announced

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

:link: Summary

Additional paths will be inserted into the search path used for executables on systems which have a compatible CPU. Those additional paths will mirror the AMD64 “microarchitecture levels” supported by the glibc-hwcaps mechanism: x86-64-v2, x86-64-v3, x86_64-v4. Systemd will be modified to insert the additional directories into the $PATH environment variable (affecting all programs on the system) and the equivalent internal mechanism in systemd (affecting what executables are used by services). Individual packages can provide optimized libraries via the glibc-hwcaps mechanism and optimized executables via the extended search path. This optimized code will be used if the CPU supports it. Which packages provide the optimized code and at which level will be made by individual package maintainers based on benchmark results.

:link: Owner

NOTE: I’m writing and filling this proposal on the last day allowed for system-wide proposals. It is too large for one person. If you are interested, please let me know or even add yourself to the list of Owners. I would love to have more people working on this.

:link: Detailed Description

Fedora binaries for the AMD64 architecture are compiled with code-generation flags that support almost all CPU variants. But newer generations of processors gained additional instructions that may be used to generate faster code. A vendor-independent x86-64 psABI supplement defines four “microachitecture levels”: x86-64-v1 (the baseline, our code targets this), x86-64-v2 (+SSE3, CentoOS targets this), x86-64-v3 (+AVX), x86-64-v4 (+AVX512) [1]. When code is compiled for a higher microarchitecture level it will crash (with SIGILL, “illegal instruction”) on CPUs which do not support it. Benchmark results show small differences in performance: usually in the range from -5% to 10%, with no discernible difference for most code, but some applications benefit, with gains of 120% in some benchmarks [e.g. 2, 4].

Over the years, various people have expressed interest in raising the required microarchitecture levels. But we have been very conservative in making changes, because support is missing in many older CPUs that are still in use, and in fact, even in some CPUs produced and sold today. By raising the required level we would make Fedora completely unusable on many machines. It also seems that recompiling all packages with the changed options would largely be a waste of resources, because for most code it makes no difference. But for some of the numerical or cryptographic code there are noticeable gains and it seems to be worth the effort to provide optimized code. This also makes Fedora more attractive to people interested in optimization.

The dynamic linker already has the glibc-hwcaps mechanism to load optimized implementations of shared objects [3]. This means that packages can provide optimized libraries and they linker will be automatically load them from separate directories if appropriate. (For AMD64, this is /usr/lib64/glibc-hwcaps/x86-64-v{2,3,4}/.)

To extend the glibc-hwcaps mechanism to executables, systemd will be modified to extend the search path with appropriate directories. When started, it will check the CPU capabilities and modify the executable search path it has internally and which is also used to set $PATH for services. (For AMD64, /usr/bin/glibc-hwcaps/x86-64-v{2,3,4}/.)

Note: the ELF format provides the IFUNC mechanism to dynamically select a variant of a function (symbol) when an executable is loaded [5]. This is in particular used to load code using specific CPU instructions when those are supported. This mechanism is both more general (because it allows arbitrary selection criteria), more fine-grained (because there can be other variants than just a few fixed microarchitecture levels), and more efficient (becuase only the parts of the code that benefit from this need to be provided in multiple variants). In particular, glibc already makes extensive use of this to provide optimized code, which is then widely used by other libraries and programs. This means that even though we compile code in a way where the lowest baseline is supported, modern CPU instructions are already widely used. This is one of the reasons why compiling for a higher baseline often doesn’t make any difference in benchmarks. The IFUNC mechanism or an equivalent mechanism should generally be preferred. Nevertheless, that needs to be implented in the program or library itself, which is not trivial. The two mechanisms in this Proposal are intended for the packages which do not support IFUNCs or some other equivalent mechanism.

[1] SUSE Hack Week: Support glibc-hwcaps and micro-architecture package generation
[2] rfcs/0002-march.rst · master · Arch Linux / RFCs · GitLab
[3] The GNU C Library version 2.33 is now available
[4] CentOS ISA SIG Performance Investigation – Blog.CentOS.org
[5] GNU Indirect Function and x86 ELF ABIs | jasoncc.github.io

Glibc-hwcaps together with the new feature in systemd provide a generic mechanism. It will be up to individual packages to actually provide code which makes use of it. Individual package maintainers are encouraged to benchmark their packages after recompilation, and provide the optimized variants if useful. (I.e. the code in question is measureably faster and the program is ran often enough for this to make a difference.)

The Change Owners will implement the packaging changes for a few packages while developing the general mechanism and will submit those as pull requests. Other maintainers are asked to do the same for their packages.

Optimized variants of programs and libraries MAY be packaged in a separate subpackage. The general packaging rules should be applied, i.e. a separate package or packages SHOULD be created if it is files are large enough.

Available benchmark results [2,4] are narrow and not very convincing. We should plan an evaluation of results after one release. If it turns out that the real gains are too small, we can scrap the effort. On the other hand, we should also consider other architectures. For example, microarchitecture levels z{14,15} for s390x or power{9,10} for ppc64le. Other architectures are not included in this Change Proposal to reduce its scope.

:link: Feedback

:link: Benefit to Fedora

The developers who are interested in this kind of optimization work can perform it within Fedora, without having to build separate repositories. The users who have the appropriate hardware will gain performance benefits. Faster code is also more energy-efficient. The change will be automatic and transparent to users.

Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 will use x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux). We implement the same change in Fedora in a way that is scoped more narrowly, but should provide the same performance and energy benefits.

:link: Scope

  • Proposal owners:

    • Extend systemd to set the executable search path using the same criteria as the dynamic linker.
    • Implement packaging changes for at least one package with a library and at least one package with executables and submit this as pull requests.
    • Provide a pull request for the Packaging Guidelines to describe the changes listed in Description above.
  • Other developers:

    • Do benchmarking and implement packaging changes for other packages if beneficial.
  • Release engineering: #11864

  • Policies and guidelines: TBD.

  • Trademark approval: N/A (not needed for this Change)

  • Alignment with Community Initiatives:

:link: Upgrade/compatibility impact

No impact.

:link: How To Test

  • Use /usr/bin/ld.so --help to check which hwcaps are supported by the system.
  • Install one or more packages which provide optimized code.
  • Restart the system or re-login to reinitialize $PATH.
  • Check that appropriate directories are present in $PATH.
  • Run some benchmarks and check that the optimized code is indeed faster.

:link: User Experience

There should be no impact for users. If they optimized code is available and installed for their hardware, various tasks may finish faster and use less energy.

:link: Dependencies

:link: Contingency Plan

  • Contingency mechanism: Undo the changes in packages which introduced them and recompile.
  • Contingency deadline: Any time.
  • Blocks release? No.

:link: Documentation

:link: Release Notes

Packages which benefit from being compiled for higher AMD64 microarchitecture levels (x86-64-v2, x86-64-v3, x86_64-v4) are now provided with optimized variants which will be used automatically on appropriate CPUs. This includes: TBD1, TBD2, TBD3.

5 Likes

Can we please use x86_64 as the name of the architecture consistently in this document? That’s the name of the architecture from the Fedora perspective.

Hmm, it’s complicated. The “System V Application Binary Interface
AMD64 Architecture Processor Supplement”, i.e. the document that defines the meaning of x86-64-v{1,2,3,4} consistently uses “AMD64”. This is the primary reason why I made the same choice.

I also think that “x86_64” is possibly the worst name amongst the many confusing names in computing. It’s hard to pronounce and hard to remember. And it was created essentially to erase history: AMD introduced the 64-bit architecture compatible to Intel 386 while Intel was pushing Itanium. I have no problem with taking about x86 for the myriad implementations compatible with i386, but by the same rule, we should call this architecture “AMD64” after the first implementation.

And it’s been published already, so I think it’s to late to retitle. I’ll add a note in the wiki version that amd64 == x86_64 == x86-64.

Do I understand it correctly; Will the lower limit be raised? Or is the x86-64-v1 baseline still supported, with additional support for higher instruction sets?

The latter. This proposal doesn’t change the default compilation flags at all.

3 Likes

Is there a specific reason to use this directory structure? I’m hoping for something that will eventually fix multilib, eg

/usr/x86-64-v{2,3,4}/{bin,lib}

Happy to help with this proposal by the way

I am open to discussion about changing the layout… Glibc-hwpaths is implemented by suffixing each of the library paths with “glibc-hwcaps/<µarch>”, and this scheme has been out for a while and is unlikely to be changed now. For executables, I followed the same pattern. It certainly would be possible to implement different scheme. The implementation effort and use would be essentially the same. (As people say, naming is the hardest problem in computer science :wink: . )

2 Likes

I would be very happy to have you on the owners list.

1 Like

Ah, if it’s an existing scheme then it’s probably easier to stick with it now, and if we revamp multilib layout we can always transition to the new scheme later.

I’ll edit the wiki and add myself!

Hi,

Have you seen how Clear Linux handle this?

glibc hwcaps is used by Clear Linux for more than 5 years ago.

It is also used in Fedora already, but is only one part of a solution to the problem.

1 Like

Great proposal.

But rather than PATH manipulation how about doing something like Plan 9 did for different architectures. Plan 9 had the files on a file server and mounted appropriately on the compute nodes. But the idea can be implemented on the local filesystem too. Kind of like with ostree managed hard links and mount /usr from the location appropriate to the architecture. Maybe even manage packages so only the architecture appropriate files land on the filesystem though have the ability to include all architectures as well to support things like diskless clients or VMs that use different architectures.

This way all the executables, libraries, etc. are architecture specific while being presented at a single path no matter which architecture ran on.

Would it be possible to build a package as -v2 only (without the baseline build at all)?

I have one that targets just Intel Broadwell and newer, and is no-op on anything older anyway (and if Intel didn’t cut AVX out from lower-tier and newer chips, I would say just -v3, but I’ll stop my rant right away…).

If I understood correctly the newer RPM changes related to x86_64 feature levels, it should be possible to build <name and version of packages>.x86_64-v2 instead of <name and version of packages>.x86_64

1 Like

Not a fan of this. If people start pinning deps on v2 or v3, won’t this create a situation where RPMs are just broken for many people on older/slower CPUs? One of the things I like about fedora is it just works.

I feel like bug reports might start filtering in where the response is ‘works on my machine’. I feel like there should be a better way. Maybe in the installer it detects the CPU then alters the arch and installs v1 v2 whatever or just have different spins for v1 v2, etc like back when we have to pick x86 or amd64.

Maybe an easy way for people to disable/exclude the v2+ binaries should be made available “just in case”. I’m thinking it would be something similar to the nomodeset parameter that people have learned to add on the kernel command line to fallback to the universal VESA video driver when their system wouldn’t come up due to problems with some more-advanced video driver. All the kernel parameter would need to do is to change the default PATH environment variable. In fact, it looks like systemd has this option already – systemd.setenv= (documented in the systemd man page).

Given systemd will be modifying the path to prefer the optimized executables, it should be possible to turn that off (or simply override the paths). It would make sense to document whichever solution is implemented and how to disable it, I agree.

We support x86-64-v1, and packages are supposed to be compiled with the default flags, and those default flags target x86-64-v1. There is as exception, already in the current guidelines: packages that require specific hardware. So if there is code which doesn’t work at all on older CPUs, then it’s fine if it only supports those newer CPUs. What package do you have in mind? It would be easier to discuss with a specific example.

In the Proposal, we don’t touch the %{_dist} part, but just put the -v2 code in one of the subpackages. If a separate subpackage is used, it would be called something like <name>-x86-64-v2-<version>.fc40.x86_64.

Why do you think maintainers would start creating such deps? This proposal is about providing equivalent versions of packages so that people can pick the one that works the best on their hardware. Creating a pin on the version would be against the spirit of the Packaging Guidelines (E.g. Fedora Packaging Guidelines :: Fedora Docs says “Versioned dependencies (build-time or runtime) SHOULD ONLY be used when actually necessary to guarantee that the proper version of a package is present” and “Packages must not contain unnecessary explicit Requires on libraries”.) It also doesn’t really make much sense.

1 Like

Such alternative approaches were already mentioned a few times in the mailing list thread (starting at Re: F40 Change Proposal: Optimized Binaries for the AMD64 Architecture (System-Wide) - devel - Fedora Mailing-Lists), based on special support from the kernel to substitute some magic strings in the paths. We don’t have such support in the kernel and it’s unlikely to be added. (And certainly not anytime quick.)

We could play with paths, but: a) we already have the glibc-hwcaps system which injects additional directories in the the search path, so it seems reasonable to use the same approach for binaries, and b) creating such a magic directory would adding quite a bit of complexity at least in the package manager. We would need to move the actual binaries to a different place where rpm can manage them, and then create a symlink farm. I’m not a fan of such complicated approaches.

The proposed system will consist of a) a dozen lines of code in systemd and b) some relatively boring packaging work to compile some binaries a few times. We already do such repeated compilations e.g. for MPI packages (against mpich, openmpi, and possibly other implementations), and it’s not particularly pretty, but it amounts to putting a big loop around some part of the %build and %install sections.

Yeah, I don’t see any great benefit to this. It’s generally wrong to specify binaries by full path. $PATH is one of the great things in Linux and we should make more use of it, not less.