An alternate approach for optimized (executable) binaries

6 months ago, there was a change proposal for introducing (optional) optimized binaries into Fedora.

Although it ended up being removed and deferred to a later date, I decided to look into some of the feedback it has gotten.

The mechanism suggested for optimized libraries seems relatively simple enough, but the one for optimized executables got a lot of concerns:

Among these concerns, the following were the most common:

  • Setting the $PATH is finicky and not guaranteed to work - programs can decide to modify it, which would break this mechanism.
  • This mechanism doesn’t work if the program is launched using its absolute path.

I’ve thought about a different mechanism, which solves these concerns, and would like some feedback on it!

Introduce a new executable, called hwcaps-loader. It first determines what’s the maximum feature level of the CPU, then, it executes the best version of the specified program, as long as its equal or below the CPU’s feature level.

When a package opts-in to this mechanism, its executables in /usr/bin/ link to /usr/bin/hwcaps-loader, which finds the actual binary in /usr/bin/glibc-hwcaps/*/.

There’s no separate “loader” program for each executable to keep storage costs as low as possible. Instead, a hard link is made from /usr/bin/{EXEC_NAME} to /usr/bin/hwcaps-loader.

hwcaps-loader determines which program it should execute by reading /proc/self/exec’s path (symlink to the process’s binary file), which will correspond to the hard link.
Using this strategy, hwcaps-loader knows exactly which executable it has to run without needing to receive any launch parameters!

Here’s a visualization of how the filesystem looks like:

And here’s a logic flow of hwcaps-loader:

Here are the main advantages of this mechanism:

  1. No reliance on $PATH: Optimized binaries can be executed simply by running the program’s usual path. Direct references still provide optimized binaries and the environment of the command never interferes.
  2. Minimal storage footprint: No need to have one wrapper/loader for each optimized binary.
  3. Minimal overhead: A specialized binary launcher will take less time to launch the program compared to an equivalent shell script.

There’s one major drawback though… packaging must be very careful in order to ensure /usr/bin/hwcaps-loader is never deleted while there’s still other links to its inode, otherwise, the inode and all the other links will still linger.
Recreating /usr/bin/hwcaps-loader doesn’t fix this - it simply creates a new inode which none of the existing links will use. A mechanism would need to be made to ensure all references are removed before hwcaps-loader.

Using symlinks instead of hard links would solve this issue, but, unfortunately, /proc/self/exec points to the actual binary file instead of the symlink, so a hard link must be used instead.

I’ve made a Rust-written prototype of this concept. It cannot determine CPU capabilities yet, but all of the loading logic is implemented and works correctly.

1 Like

You could use the command interpreter invocation mechanism. So /usr/bin/foo would contain this:

#!/usr/bin/hwcaps-loader
foo

And /usr/bin/hwcaps-loader would read the file and use the command from the second line. The existing #! handling in RPM might even generate the right dependencies for this.

Hi José,

I think this is a great proposal. After the previous approach flopped, we had some discussions how to implement this, and we came to a similar conclusion. I would be happy to work with you one a Change Proposal for F41 to implement something like this.

Hard links are problematic. Or more precisely, hard links between different packages are problematic. (For example, we hardlink files in a single package if the content is identical, this happens a lot with Python pyc packages. When those are in a single package, rpm supports this nicely.) The hardlinks would need to be create via scriptlets when the package is installed. I don’t think this is worth it.

Instead, just check argv[0] and use the filename part of the path to determine what to launch. This is commonly done with multicall binaries, e.g. systemd-udevd is nowadays just a symlink to udevadm.

I’ve made a Rust-written prototype of this concept.

Can you post the code somewhere? How does this compare with a C program in size and execution overhead?

/usr/bin/glibc-hwcaps/

I’m not enthusiastic about including this in $PATH. It’ll mess with autocompletion. Why not /usr/libexec/glibc-hwcaps?

Or actually, just /usr/libexec/hwcaps? This is not tied to glibc in any way.

(For AMD64, /usr/bin/glibc-hwcaps/x86-64-v{2,3,4}/)

I don’t think the last path component is needed. Just put the µarch info in the file name suffix: glibc-hwcaps/foo.x86_64-v2, glibc-hwcaps/foo.x86_64-v4. This will make it easier to list the binaries, and also a bit easier to package them.

Actually, I think we can do better with this new approach. With the previous approach of extending $PATH, there was always going to be strong tension for the number of possible µarch versions: more variants means that the $PATH gets longer, but OTOH, programs that have special architecture needs can find a better fit. With the loader-based approach, there isn’t any particular limit on the number of variants, and we don’t need to stick to the pre-defined µarch levels. So for example, let’s say that a program works best when compiled with AVX2, so we could have something like glibc-hwcaps/foo.x86_64-v2+avx2 and the loader would split at + and launch one of the binaries that match the requirements. (The details of the naming to-be-figured-out.)

My national exams make scheduling a bit tricky, but I’m definitely open to collaboration.

Indeed… I liked the hard links approach, but it definitely brings a lot of unnecessary complications. My one concern with the argv[0] approach is that it doesn’t necessarily point to the program’s absolute path - it can be a relative one, or just the execution command in shell. That’s why I preferred the hard link one, as it simplifies the process of figuring out which program is executed.

For example:

jrelvas@aquecedor-3-0:~$ ~/.var/app/org.gnome.Builder/cache/gnome-builder/projects/hwcaps_loader/builds/default-host-x86_64-main/src/test-program
The first argument is /home/jrelvas/.var/app/org.gnome.Builder/cache/gnome-builder/projects/hwcaps_loader/builds/default-host-x86_64-main/src/test-program
jrelvas@aquecedor-3-0:~$ PATH=~/.var/app/org.gnome.Builder/cache/gnome-builder/projects/hwcaps_loader/builds/default-host-x86_64-main/src test-program
The first argument is test-program
jrelvas@aquecedor-3-0:~$ cd ~/.var/app/org.gnome.Builder/cache/gnome-builder/projects/hwcaps_loader/builds/default-host-x86_64-main/src && ./test-program
The first argument is ./test-program

In the first case, this approach works correctly because it’s an absolute path. The third case is a bit different because it’s a relative path, but it’s possible to infer the absolute path by checking the working directory.

…But it’s much harder to deal with the second situation, as we simply we get an executable name with no path to guide us. The only workaround I can think of for this situation is simply assuming that any of those cases reside inside of /usr/bin…

I’ve heard of some people suggesting the use of an Overlay FS, but I’m not quite sure how that would work.

Sure! I’ve not made a repo since it’s not really complete yet (there’s no logic for actually choosing which feature level to execute yet, just the “finding the target program” part), but here’s the code:

use std::env;
use std::process;

fn main() {
    //This uses /proc/self/exe internally... which doesn't point to symlinks. Must use hardlinks instead.
    let exe_path = match env::current_exe() {
        Ok(p) => p,
        Err(e) => {
            println!("failed to get current exe path: {e}");
            process::exit(-100);
        },
    };

    let file_name = match exe_path.file_name() {
        Some(n) => match n.to_str() {
            Some("hwcaps_loader") => {
                println!("hwcaps_loader can't load itself");
                process::exit(-100)
            },
            _ => n,
        },
        _ => {
            println!("No filename found");
            process::exit(-100)
        }
    };

    let hwcaps_path = match exe_path.parent() {
        Some(n) => n.join("glibc-hwcaps"),
        None => {
            println!("No parent found");
            process::exit(-100)
        }
    };
    let status = process::Command::new(hwcaps_path.join("x86-64-v1").join(file_name))
        .status()
        .expect("failed to execute process");

    match status.code() {
        Some(c) => process::exit(c),
        _ => {}
    }
}

I didn’t actually make a C equivalent of this. There was no particular reason for it to be in Rust, it’s just my preferred language. I definitely don’t mind using C instead, though.

Sure, that sounds reasonable! There’s no particular reason for putting the files in /usr/bin directly.

Agreed.

This could be a solid idea, actually. There are a surprisingly large amount of devices which support AVX, but not AVX2. (Therefore, they’re only x86_64-v2 instead of x86_64-v3.)

Oh yes, the argv[0] approach might be problematic if the target program is a symbolic link! hwcaps_loader would try to find optimized versions of programs that don’t actually exist, in this case. I’m not sure how to solve this issue besides adding matching links in /usr/libexec/hwcaps.


Maybe this is something which could be solved at the Kernel level? The current issue is how /proc/self/exe points to the real program instead of its symlink; maybe kernel devs could add a different interface which points to the symlink which was originally executed?

Yes, but I don’t think this an issue for the intended use case. Our programs reside in one directory (/usr/libexec/hwcaps or whatever we decide in the end), so we don’t actually care about the original call path. Only the filename matters, and we get that in all cases.

(The user can symlink the program under a different name or just specify a fake argv[0]. But that’s OK, if they want to, they can do that.)

When I build this rust program with --release, it’s 472kB. Not too much, but I expect that the C equivalent would be much much smaller. I like Rust, but I think it makes sense to write a C version too to compare.

[profile.release]
strip = true
opt-level = "s"
lto = true
codegen-units = 1
panic = "abort"

There might be a more up to date version of ths magic recipe, but it
should reduce the binary size quite a bit.

I get 328 kB with that.

If the following points are met, I think the symlink approach should be okay:

  • Symlinks in /usr/bin/ are not allowed for packages using this mechanism. The following topology should be used instead:
    • /usr/bin/alias (symlink) → /usr/bin/hwcaps_loader
    • /usr/libexec/hwcaps/alias.* (symlinks) → /usr/libexec/hwcaps/actual_binary.*
  • It’s acceptable for symlinks created by the user (or an unpackaged program/misbehaving package) to any binary using a hwcaps symlink in /usr/bin/ to always fail.
  • It’s acceptable for all binaries using a hwcaps symlink in /usr/bin/ to always fail when argv[0] is an unexpected value.

After giving this some more thought…

All binary feature levels (including baseline, even for packages which provide no optimizations), are placed in separate directories under /usr/libexec/hwcaps/ (maybe put it in an entirely different path in /usr?):

/usr/libexec/hwcaps/x86_64-v1
/usr/libexec/hwcaps/x86_64-v2
/usr/libexec/hwcaps/x86_64-v2+AVX
/usr/libexec/hwcaps/x86_64-v3

Once we’re getting out of initrd (or another equivalent bootstrapping environment), systemd determines what’s the highest feature level supported by the machine.

It then mounts an OverlayFS to /usr/bin, with the lowest directory being the lowest feature level available and the highest one being the maximum level supported.

This should make all binaries accessible in /usr/bin! Higher feature levels are preferred, with a fallback for binaries with only lower levels!


This approach might be the best of both worlds:

  • Doesn’t have edge cases where the behavior changes, unlike the symlink approach.
  • Much easier to package properly, unlike the hardlink approach.
1 Like

This is certainly an interesting approach. The overlay could even use a few layers, combining a few different dirs.

The problem is how this would interact with rpm and updates. In particular, rpm -V would report errors, and if the upgrade tries to replace the file in /usr/bin, it’d get confused and probably try to do something to one of the files in the overlay.

But maybe this can be made to work.

Here’s an example showing how the topology could look like:

/usr:

OverlayFS:

I think this is plausible if RPM and DNF are updated to have a concept of feature levels. The feature level of the package determines which directory in /usr/hwcaps rpm/dnf would install the files in.

foo.x86_64 → /usr/hwcaps/x86_64-v1 + /usr/lib64
foo.x86_64.v2 → /usr/hwcaps/x86_64-v2 + /usr/lib64/glibc-hwcaps
foo.x86_64.v3 → /usr/hwcaps/x86_64-v3 +/usr/lib64/glibc-hwcaps
etc.

We could probably handle all of the noarch packages which want to place files in /usr/bin or /usr/libexec by directing them to /usr/hwcaps/noarch.

As a bonus, this would allow for more flexibility with installs. A sysadmin could decide to not keep “fallback” versions of packages installed in order to save on storage, for example.

In short:

New paths added to topology (not set in stone):

  • /usr/hwcaps
  • /usr/hwcaps/{arch_name}{feature_level}/bin
  • /usr/hwcaps/{arch_name}{feature_level}/libexec
  • /usr/hwcaps/noarch/bin
  • /usr/hwcaps/noarch/libexec
  • /usr/lib64/glibc-hwcaps

RPM/DNF

Add a “feature level” to RPM binaries. If no feature level is specified, assume the level is that architecture’s lowest/baseline level. Multiple feature levels of the same architecture can be installed at the same time.

All architectures (except i686):

  • /usr/bin/ and /usr/libexec/ are redirected to /usr/hwcaps/{arch_name}{feature_level}.
  • If the feature level is the “baseline” one, /usr/lib64/ remains unchanged.
  • Otherwise, /usr/lib64/ is redirected to /usr/lib64/glibc-hwcaps/{arch_name}{feature_level}.
  • /usr/lib/ and all other paths remain unchanged.

i686:

  • /usr/bin/ and /usr/libexec/ are redirected to /usr/hwcaps/x86-64-v1.
  • /usr/lib/ and all other paths remain unchanged.

(Note: Maybe special-casing i686 isn’t necessary?)

noarch:

  • /usr/bin/ and /usr/libexec/ are redirected to /usr/hwcaps/noarch.
  • All other paths remain unchanged.

Booting process

initrd remains unchanged.
Differences in behavior start when systemd is mounting the root path. Systemd determines which architecture levels are supported by the system, then, mounts a OverlayFS for both /usr/bin and /usr/libexec. The best feature levels are the top-most layers:

/usr/hwcaps/noarch/bin    | 
/usr/hwcaps/x86_64-v1/bin |
/usr/hwcaps/x86_64-v2/bin |
/usr/hwcaps/x86_64-v3/bin |
/usr/hwcaps/x86_64-vN/bin |
             Mount order  V
/usr/bin 

Building initrd/UKI

No major changes should be required - generally, just make sure you’re adding the binaries and libraries with the lowest feature level when generating the images.

Unanswered questions

  • The fate of i686/multiarch:
    Need an elegant way for dealing with multiple installs. It’s probably a good idea to not allow mixing of binaries with different architectures. (foo.i686and foo.x86_64.v3cannot be installed at the same time if they both provide binaries.)
  • Handling situations where no suitable binary is available:
    Is this something to worry about? Should there be some kind of “fallback error program”, or is it fine to simply let the binaries go missing?

I don’t think that prerequisite is feasible. If we require rpm and dnf5 to be updated, we can schedule this feature for alpha testing in F43. :wink: And there’d need to be some code in systemd too, and maybe container managers (you want to run optimized code in containers too, why not?).

The original proposal(s) with a loader binary are small and localized and can be implemented in a few days. Heh, you even wrote part of the code already.

Currently we don’t even have a list of executables which would benefit from this scheme. Maybe it’ll be just a handful. I don’t think it makes much sense to plan a grand scheme before we know that this is even going to be used. If it turns out that there’s massive demand, we can consider implementing the big scheme.

Sure. A loader-based approach is a nice way to experiment and try to measure the benefits of optimized binaries for a select amount of packages.

Since it’d be experimented with specific packages, it’d be much easier to control the limitations with the symlink approach, too.

It’s probably worth considering codifying the feature levels into the package management itself long-term, though.

Here’s some candidates which I believe might benefit from optimized binaries/libraries:

  • Mesa
  • Firefox
  • MPV (I’ve actually tested this one - a x86-64-v3 build reduced CPU package power draw by around 0.10W, netting a few extra minutes of video playback.)
  • Core system components like graphical shells, compositors, and Pipewire.
  • Generally stuff which is pretty math-heavy or multimedia-intensive.

Performance isn’t necessarily the only win - it looks like these optimizations can also reduce power consumption, which might be a big deal for mobile systems. We should probably benchmark for that too, along with regular performance metrics.

I’m going to make a newer prototype of hwcaps_loader which implements some of the things we’ve discussed.

@zbyszek

Here’s my suggestion for allowing optimized binaries in /usr/libexec to be handled by hwcaps_loader:

Besides /usr/bin/hwcaps_loader, we could also have /usr/libexec/hwcaps_loader as a hard link.

Eligible binaries in /usr/bin/ symlink to the former, eligible binaries in /usr/libexec/ symlink to the latter.

The target binary’s path is determined through argv0 as usual, but, to determine if it’s in bin or libexec, we check the parent folder of /proc/self/exe, which should be /usr/bin or /usr/libexec, depending on the hard link.

What do you think?


Edit: Actually, this might not be necessary assuming the argv0 of binaries in /usr/libexec are always absolute or relative to the working directory… I think.

Hmm, but what’s the purpose of this complication? This would only be useful there was a binary with the same name in /usr/bin and in /usr/libexec. This doesn’t seem usefulor needed. The idea of having binaries with the same name in different places and different behaviour or capabilities is was used in the past, but it’s not done any more. I’m pretty sure we shouldn’t add any cases.

I was able to greatly reduce the binary’s size by dropping Rust’s libstd with !#[no_std] and interfacing with libc directly.

It does make the code somewhat more verbose, but this shouldn’t be a huge deal.

fn handle_path_aliases(program_name: &[u8]) -> &[u8] {
    let mut parent_dir = vec![0; MAX_PATH_LEN as usize];

    let size = unsafe{ readlink(b"/proc/self/exe\0" as *const u8, parent_dir.as_ptr(), MAX_PATH_LEN as usize) };

    if size < 0 {
        panic!("Failed to read \"/proc/self/exe\" link! ERRNO: {}", unsafe { *__errno_location() })
    }

    parent_dir.drain(size as usize..MAX_PATH_LEN as usize);
    let mut parent_slice = parent_dir.as_slice();

    let last_dash = memrchr(b'/', parent_slice)
        .expect("No dash (/) in filepath. This should be impossible??");

    parent_slice = parent_slice.get(0..last_dash+1).expect("parent_slice out of range!");

    let mut resolved_path: Vec<u8> = Vec::new();
    resolved_path.try_reserve(parent_slice.len() + program_name.len() as usize).expect("Failed to reserve vector");

    resolved_path.extend_from_slice(parent_slice);
    resolved_path.extend_from_slice(program_name);

    resolved_path.leak()
}

I’ve finished a new prototype for hwcaps-loader.

Here are the key changes:

  • Removed dependency on rust’s stdlib. This should save a lot of space and make the binary’s size more or less equal to a C equivalent. :wink:
  • Implementation is now based on symlink/argv0 detection. It’s not 100% perfect… but it should be good enough for our purposes.

For now, there’s no logic in place for actually determining which feature level should be used or for executing the program. The latter shouldn’t be complicated - I’ll probably be able to get that mechanism working in the coming days, but the former needs more investigation.
I think glibc provides an API for detecting the feature levels at runtime… so the mechanism is probably going to use that as a basis.

Here’s the source code:


Edit: Wow, turns out doing the execution part was much easier than what I expected. I’ve added that too, now. :melting_face: