F42 Change Proposal: Optimized Binaries for the AMD64 / x86_64 Architecture (v2) (self-contained)

So I found out about this through the Phoronix article, so I’m presumably not up to date on the various details, but on the face of it, I have some technical concerns.

If I’m reading this right, this relies on a few things:

  1. argv[0] containing the invoked name, in order for a symbolic link to resolve. This is unreliable and plain doesn’t work if another symlink is used. A hard link would be better, although would have the same problem if the user creates a hard link.
  2. It is acceptable to the user/system to search a path, which could be a security problem.
  3. The overhead of a second exec() would be acceptable.

I can think of at least a few ways this could be done “better”.

  1. Make /usr/bin an overlay filesystem. The attributes of the running CPU are known at boot time! Upside: completely invisible to the user, should be fast. Downside: global change, and may need some awareness on part of the package manager.
  2. Create a new binfmt “indirect executable” and handle the indirection in the kernel. This could be as simple as a text file with a magic number pointing to the paths of a set of executables and their requirements. Upside: should be fast. Downside: kernel support needed, could still cause hard link issues unless pathnames are absolute.
  3. Stub executable. An executable with the appropriate pathnames and requirements compiled in, which uses dlopen() to open the proper executable. Upside: most likely faster than a whole new exec(). Downside: requires PIE executables, could still have hard link issues unless pathnames are absolute.
  4. Fat executables. The executable is an archive of the actual executables, and ld.so would need to select the appropriate one. Upside: well understood technology; could be done entirely in user space if the fat executable has an appropriate ELF header. Could merge identical sections (e.g. data sections) with a smart enough linker. Downside: has been rejected before in Linux; significant development effort (support in the ELF interpreter ld.so.)
  5. Relying entirely on the system default $PATH. Upside: trivial to do. Downside: anything that explicitly sanitizes the PATH will most likely lose this information.
  6. Using filesystem extended attributes to point to alternative binaries. Pretty much a different implementation of #2.
  7. Add support using filesystem extended attributes to make it possible to do exec() on a directory. Upside: no link issues (directories cannot be hard linked); probably fast. Downside: Al Viro might go ballistic :wink: , might encourage application developers to do NeXT/Mac-style “application code and data directory” hacks.

To be honest, this is probably the solution I would prefer the most.

[hpa] H. Peter Anvin https://discussion.fedoraproject.org/u/hpa hpa
January 17

So I found out about this through the Phoronix article, so I’m
presumably not up to date on the various details, but on the face of it,
I have some technical concerns.

If I’m reading this right, this relies on a few things:

  1. argv[0] containing the invoked name, in order for a symbolic link to
    resolve. This is unreliable and plain doesn’t work if another
    symlink is used. A hard link would be better, although would have
    the same problem if the user creates a hard link.

[jreiser]:
This line of reasoning became obsolete 16 years ago. The linux kernel
preserves in ElfXX_auxv.{AT_EXECFN} the character string of the first
parameter to execve(). For many years before that the kernel had been
keeping an unadvertised copy near the highest user address.
I didn’t change the location, but did codify the usage:
$ git blame fs/binfmt_elf.c
651910874633a (John Reiser 2008-07-21 14:21:32 -0700
280) NEW_AUX_ENT(AT_EXECFN, bprm->exec);

Of course anything involving pathnames is subject to TOC-TOU races
(Time Of Check versus Time Of Use), but at least the race does have
a fair start.

[hpa]:

  1. The overhead of a second exec() would be acceptable.

[jreiser]:
A second execve() is expensive and unnecessary. As I mentioned on

The helper can just mmap() ld.so, pre-pend argv arguments such as
–preload, --library-path, and/or --argv0 according to the helper’s
diagnosis, and jump to ElfXX_Ehdr.e_entry of ld.so. ld.so already
must self-relocate (that’s one of the jobs of a PT_INTERP!),
so merely mmap()ing ld.so according to its PT_LOADs is enough.
The only execve() is the one that invoked the helper itself.

[hpa]:

I can think of at least a few ways this could be done “better”.

  1. Make /usr/bin an overlay filesystem. The attributes of the running
    CPU are known at boot time!

[jreiser]:
Nope! Hint: a virtual machine container which migrates across different
CPUs at quiescent times.

[hpa]:

  1. Stub executable. An executable with the appropriate pathnames and
    requirements compiled in, which uses dlopen() to open the proper
    executable. Upside: most likely faster than a whole new exec().
    Downside: requires PIE executables, .

[jreiser]:
Thee is no such requirement if the helper can detect a conflict and
move itself out of the way. This is not too hard; at one time
Valgrind could do this. Also, there is a guaranteed safe location:
ending at one default thread stack size below the initial stack pointer.

[jreiser, again from devel@lists.fedoraproject.org]:

The helper should be ET_EXEC -static: no shared libraries,
and no PT_INTERP. For those who are paranoid about losing ASLR:
rebuild the helper every few hours using a small[-ish] set of
base addresses. Or, make the helper ET_DYN -PIE -static
(again: no shared libraries, no PT_INTERP) and [re-]pre-link it
every few hours.
… or take the minor hit from dynamically [self]-relocating the
PIE helper, which can be as small as several kilobytes with just
a handful of absolute addresses.

The evaluation of the different approaches and tradeoffs between them will depend on the number and size of the executables. The general problem of “we have some binaries and we want to have different variants of them” is a very general problem. In Fedora, we’re actually looking at a corner case of that problem: we have thousands of binaries installed on the system, and only a few of them will be provided in multiple variants. In addition, our solution only has to work for packaged binaries which are public and don’t do unexpected things. We also don’t need to squeeze out the maximum performance out of the dispatcher. We really don’t care about the case where somebody is spawning thousands of executables per second. If they are doing that, then there are much better ways to optimize then to make the individual binaries run a few percent faster.

To answer the specific concerns:

  1. argv[0] containing the invoked name, in order for a symbolic link to resolve. This is unreliable and plain doesn’t work if another symlink is used.

This is correct. By relying on the original symlink name, we place the limitation that a binary cannot be symlinked under a different name. I think this is an acceptable limitation.

Hardlinks are complicated for us. It is fine to have hardlinks within a single package (this is actually used heavily e.g. to deduplicate idential python bytecode files), but hardlinks for files in different packages cannot be handled natively by the package manager. We could implement the linking, for example via a scriplet that would find all symlinks to the dispatcher binary and replace them by hardlinks. This is doable, but would be something of a complication. I would prefer to keep things simple in the beginning. We can always consider something like that later if it turns out that the symlinks cause problems. But my expectation is that there’ll be no issue.

  1. search a path, which could be a security problem

I don’t see any security problem. In fact, everything is fully public so anyone can get a list of the involved programs using repoquery.

  1. The overhead of a second exec() would be acceptable.

Yes. Early benchmarks indicate that it’s entirely acceptable. The proposed solution is quite simple. Some of the other solutions, while technically very interesting, are significantly more complex. In particular, any kind of dynamic dispatch without exec, would in my opinion require much more initial work and then would probably uncover “interesting” corner cases and bugs. I’m sure that there are situations where those approaches have merit, but I think we’re better off with a simpler approach in our corner of the problem space.

  1. Make /usr/bin an overlay filesystem.

This has the disadvantage that it requires more upfront work. When the machine is booting, we’d need to check the CPU and pick a variant of every binary that has variants and assemble the full filesystem. And when the user adds or removes binaries, we’d need to somehow trigger a reload. This is certainly doable, but it’d be more of a solution for the case where we want to comprehensively handle all of the binaries installed on the system. The solution seems overkill for the case where there’s a small number of binaries and the user might not even use more than one or two.

binfmt “indirect executable” and handle the indirection in the kernel
Stub executable. dlopen. PIE executables
Fat executables
using filesystem extended attributes to make it possible to do exec() on a directory

Those all require significant development of new features… and have various downsides.

Relying entirely on the system default $PATH

Yes. Actually the first version of this proposal used $PATH (Changes/Optimized Binaries for the AMD64 Architecture - Fedora Project Wiki). It was rejected for various reasons. The primary one is that if we could end up with a dozen additional paths to search, in cases where there are many microarchitecture variants on a given architectures and the CPU supports them all. It’d also be quite visible to users. And changing $PATH dynamically is not possible, so the whole thing would be rather unwieldy.

Using filesystem extended attributes to point to alternative binaries

This would work, but ATM I don’t think we actually need this. The plan is to put the alternative binaries in different directories, in particular because then they are not visible in the search path. We want to use a fixed naming convention to make this manageable, which means that we don’t actually need to specify the alternative binaries explicitly. If we wanted to specify them, I guess extended attributes would be the way to go.

This is certainly an interesting solution. I just don’t think the additional complexity is warranted. Bluntly: if I’m spawning an optimized bzip2 to compress a file, I really don’t care if my 20 minute runtime is extended by 1 ms or not.

But if you can implement a helper that does such relocations and works better than the current approach, we can certainly consider switching to it in the future.

1 Like

I think the idea that John Reiser has is basically my #3. Again, anything that depends on the install pathname of the file is going to have problems at the very least with hard links or overmounts; there is no “real” location for a hard link. For a symlink, realpath() will give its proper location.

And yes, it is true that ld.so as a PIE/PIC executable can be safely located outside the (0,2G] range that is used by medium model executables and most likely any large model executables if any ever are created.

The only reason I said PIE/PIC is that I was thinking that supporting loading of non-PIE images might need some work in ld.so.

One nice thing with #3 is that I believe it can be done without any modifications to ld.so whatsoever. Indeed, the problem is now exactly the same as loading shared libraries, because the program itself is now a shared library that just happens to include _start. It might not even be necessary for the stub executable to contain any actual code, depending on exactly how ld.so handles DT_INIT and DT_RUNPATH, although it might be desirable to have some code to mangle the auxiliary vector to point it to the “proper” executable.

It is also an option to have the stub executable be a non-libc (completely standalone, using raw syscalls) executable invoked before ld.so. It would then have to pick an executable and map ld.so and the chosen executable. However, with such a lightweight stub executable, it is probably more or less as fast to invoke sys_execve() and let the kernel do it all in one shot instead of requiring multiple sys_mmap() calls. This is really the #2 option done in user space.

Why not updating rpm to decide which library or binary to install / copy?

Because the CPU can change for example on a system installed on a removable driver.

How does changing CPUs (to one with reduced features) work today? Wouldn’t processes that have already dispatched to particular implementations based on runtime CPU feature detection all crash?

So, how about making it configurable? Let the user decide. Default is v1, and user can set it to v2-v4. RPM, before installing anything, can detect if a user set v4, but the CPU doesn’t support that.
Finally, if user changes the config (ugrading from v1 to v3, or downgrading to v1) - dnf upgrade / downgrade can update all related libraries and binaries.

Why would we want users to have to understand and actively configure CPU features in order to take advantage of them? Who’s going to actually do that? Surely not enough people to make the extra packaging work worthwhile. We might as well just stick with the status quo if we’re going to make people jump through hoops like that.

3 Likes

The fedora devel list has been discussing this. It is not clear if there is enough support in rpm/dnf to do this.

Good question, I do not know.

One way to reduce CPU features is to shutdown a virtual machine container, move the container to a CPU with reduced features (perhaps by using a hypervisor configuration option on the same physical machine), then re-boot the container.

One example why reducing CPU features might be desirable. AVX512 has 16 %zmm registers of 512 bits each, for a total of 8K bits. So 1KiB of state must be stored and loaded at each context switch for each process that is using AVX512, which can be expensive if done at high frequency. Machines might want to cater to different classes of use according to a schedule, such as fast interactive response (low process state) for many users during a traditional work day, and more-efficient computation for fewer (but longer) CPU-bound tasks at night.

RPM already knows how to do this. We just don’t expose it in DNF for the solver to select when downloading packages from a repository. We could easily choose to do this, as openSUSE does it this way now with Zypper.

This change proposal has now been submitted to FESCo with ticket #3342 for voting.

To find out more, please visit our Changes Policy documentation.

Just to clarify: we can have “offline” hardware changes, where the system is stopped, hardware is replaced or the disk is moved to different hardware, and then restarted, and “online” hardware changes, where a running is somehow frozen and the CPU is changed underneath it.

Let’s talk about the online case first. The case of a CPU changing underneath a running process can occur in special circumstances only, for example when a VM is snapshotted and the image is moved to a different machine and the execution is continued there. Another case would be if CRIU is used to move a process.

In general, CPU features cannot be changed at runtime. Generally, programs can test the CPU features when they are started and there is no provision for those ever changing, so any change would create undefined behaviour.

The only correct way to move VMs or processes is to keep the CPU configuration unchanged. (In particular, in case of a VM this may mean that the CPU features exposed to the VM are not exactly what the host sees, but some standard subset, so that when the VM is moved to a similar but different physical CPU, the same standard subset is still supported, so the VM does not see a changed CPU.)

So to the question “how does (online) changing CPUs work?” the general answer is “no”.


Taking an image and restarting it on a different CPU, i.e. an “offline CPU change”, is a different story. Currently, all Fedora compiled code is required to support the baseline for a given architecture, so when the code is moved to different hardware, or the hardware is partially replaced, as long as this hardware meets the baseline, the code still works. When the process is started on the new hardware, it checks the features and DTRT. In general, we nowadays install systems in a way where no information about the hardware is recorded on the system, everything is discovered dynamically. This means that a disk can be plugged into a different machine and things are expected to “just work”. (One common exception is initrd images, where we put only some subset of drivers in the host-only initrd to make boot quicker. But the main system would still have the full set of drivers.)

This is where the approach with dynamic dispatch is much nicer than the approaches with different rpm sets and the installer making a choice. In that second case we are “locked” into specific features and we cannot in general move the image to a different machine and expect it to work. This is another reason why I don’t think this approach should be pursued.

1 Like

Okay, so why not revive the FatELF concept for this? It eliminates almost all of the complexity here around figuring out where and how to get binaries by simply just having the different subarches in the same executable file.

I would like to briefly caution against any approach that would rely on the assumption that x86_64-v(n+1) is a superset of x86_64-v(n). Due to Intel’s utter foolishness, it will likely be a long time before it’s possible to construct an x86_64v5 that is both a superset of v4 and widely supported by (even new) CPUs.

Nothing suggested so far has that problem, but one such Bad Idea did occur to me while thinking about $PATH-based approaches.

2 Likes