Verifying the authenticity of files uploaded to the lookaside cache

Hello everyone,

I am investigating the verification of source code integrity after software is packaged and included in Fedora’s repositories. In the course of this research, I found a potential issue that is worth discussing further.

Summary

No automated checks ensure that the file referenced in remote Source: URLs are the same as the files actually used by Koji at build-time. Since no such check occurs, a wrong version or maliciously modified source code archive could be built and distributed in place of upstream’s source code archive. This is a possibility even if a packager is acting in good faith, and could result in distribution of malware via the official package repositories.

Analysis of all 35,665 remote files referenced in Fedora’s .spec files reveal that about 4.5% of remote files do not match what is in the lookaside cache and about 4.9% were unreachable. Specifically looking at files which did not match, were Source0, and were an archive (2.8% of all files), only 67.1% of files in this subset did not contain any differences in archive content. This leaves us with 333 files that either had differences in content or their source URLs did not return an archive.

Given this data, it is worth investing engineering time to try and improve the situation. For example, a pre- or post-build step could be introduced to verify the integrity of files used during the build.

Problem

The files that package maintainers upload to dist-git do not have to match the original file published by upstream. We do not seem to check hashes for package source code archives outside of the package review process. While there are edge cases where an upstream’s file must be modified or is unversioned, the majority of source files should match their upstream counterparts.

A packager could upload a vulnerable older version of source code or maliciously modified source code. This could cause Fedora to distribute malware. There are multiple possible ways this could happen:

  • Packager accidentally uploads the wrong version from upstream
  • Attacker takes over an abandoned upstream’s website and host backdoored source archives. Packager does not notice and packages using compromised upstream URL.
  • Packager maliciously modifies source code archive before uploading to dist-git
    • Attacker could also compromise a packager’s FAS credentials. 2FA isn’t mandatory for packagers.
  • Attacker preforms man-in-the-middle attack while a packager is working on a package using insecure protocols.
    • 15% of remote files use http: or ftp:

Data and Analysis

Comparing Source and Lookaside Cache Files

To explore this, I used spectool to extract the SourceX: URLs from every single .spec included in rpm-specs-*.tar.xz from here. Then, I hashed every remote file and compared them to their counterpart on the lookaside cache. The results are below:

Status Definitions

Status Definition Notes
Match SHA256 hash of the source file matched the file on the lookaside cache.
Match, >1 version SHA256 hash of the source file matched at least one version of the file on the lookaside cache. This is makes sense for unversioned files, but indicates a mistake if observed for a versioned file.
Source unreachable The source URL was unreachable. May occur due to source URL being taken down, networking issues, typos in the spec file, or bugs in my script.
Lookaside unreachable Could not find a file on the lookaside cache. May occur because packager has not uploaded a file yet, bugs in the hashing script, or the file is stored in git instead.
Mismatch SHA256 hash of the source file did not match any files found on the lookaside cache. May occur due to source URL being replaced with another page/file, bugs in my script, hitting bot detection/firewall, or the file being unversioned and the source changing.

Analyzing a Subset of Mismatches

Although 4.5% of remote files not matching seems bad, running further analysis on files that had a hash mismatch, were Source0, and were an archive show better numbers:

Diff result File Count
No difference in content[1][2] 679
File returned was HTML[3] 101
File was archive with differences between files present in both archives[4] 127
Other/Uncategorized 105
Total Files 1,012

Real-World Testing

I added a newline to a file in one of my packages before releasing an update. As you can see in Bodhi, all checks passed and the update is being distributed.

SHA512 (lookaside-numix-icon-theme-24.04.22.tar.gz) = c59440f5fb1d4c4c54d863d9815ae8da0976dab92e079b206025f372664995a38c841618173ffc38871f2c1766ae3bc32e143195fcd71c34d554a1f1f589fe70
SHA512 (source-numix-icon-theme-24.04.22.tar.gz)    = d7bee3436c901e1f43b74a0e1312e4d40ce5ebc0b8a681d76c9fc3135c581b9a19252678edeaebece5dbb65afc011cf40dac52e24751f2ee273fd5338dfa8acf

Insecure Protocols

14% of files are fetched using http: and 0.9% use ftp:. As mentioned before, this could be problematic due to man-in-the-middle attacks.

Possible Solution

In the short term, a waivable check comparing the hash of remote file hashes against the files in a package’s .src.rpm could be created.

In the long term, the buildsystem should handle fetching sources for packagers. Outside of edge cases that prevent a file from being retrieved automatically, packagers shouldn’t need to interface with the lookaside cache. There’s a long-standing issue in RPM that I’ve linked below discussing this.

References

Corrections

Lastly, please call out any mistakes or bugs. I’m sure that my data isn’t perfect, but it should be a good starting point for a discussion.


  1. This likely means that the date or compression settings were different from upstream’s published archive. I.e. the packager repacked the archive. ↩︎

  2. 514 of these mismatches with no difference in content were rust-* packages. I am not sure why. ↩︎

  3. This presents an issue as a hostile party could take over a website and impersonate upstream or trick someone who is unfamiliar with a given package (e.g. proven packager) to download modified sources. If you look through the data you can find examples of upstream websites that have been taken over by gambling sites. ↩︎

  4. Different version and/or source code was manually modified by packager. Includes false positives where packagers use main instead of the same ref that upstream releases the archive with. ↩︎

5 Likes

This seems weird. We download .crate files from crates.io and don’t modify them. And as far as I know, they’re uploaded once when publishing and never modified afterwards …

It’s possible that crates.io changed how they compress archives at some point and the packages haven’t received updates since then. I haven’t looked too deeply into it since no files inside the archive differ.

1 Like

Hello everyone,

I am investigating the verification of source code integrity after software is packaged and included in Fedora’s repositories. In the course of this research, I found a potential issue that is worth discussing further.

Ah… memories.

I used to run a hacked up shell script like this:
https://www.scrye.com/~kevin/fedora/sourcecheck/

Summary

No automated checks ensure that the file referenced in remote Source: URLs are the same as the files actually used by Koji at build-time. Since no such check occurs, a wrong version or maliciously modified source code archive could be built and distributed in place of upstream’s source code archive. This is a possibility even if a packager is acting in good faith, and could result in distribution of malware via the official package repositories.

Analysis of all 35,665 remote files referenced in Fedora’s .spec files reveal that about 4.5% of remote files do not match what is in the lookaside cache and about 4.9% were unreachable. Specifically looking at files which did not match, were Source0, and were an archive (2.8% of all files), only 67.1% of files in this subset did not contain any differences in archive content. This leaves us with 333 files that either had differences in content or their source URLs did not return an archive.

Given this data, it is worth investing engineering time to try and improve the situation. For example, a pre- or post-build step could be introduced to verify the integrity of files used during the build.

The problem there is that it then makes us depend on external internet.
I mean, I support a check, but I don’t think it should block/be required
in most cases.

Perhaps it could be something we could add to our existing CI setups?

…snip…

Insecure Protocols

14% of files are fetched using http: and 0.9% use ftp:. As mentioned before, this could be problematic due to man-in-the-middle attacks.

Yeah, although for fedora builds those are not used, and maintainers may
well not use them either. But would be good to clean them up…

Possible Solution

In the short term, a waivable check comparing the hash of remote file hashes against the files in a package’s .src.rpm could be created.

Yeah, that seems like a good step.

Trying to refine the list and get maintainers to fix things also seems
like a good idea.

In the long term, the buildsystem should handle fetching sources for packagers. Outside of edge cases that prevent a file from being retrieved automatically, packagers shouldn’t need to interface with the lookaside cache. There’s a long-standing issue in RPM that I’ve linked below discussing this.

That complicates things a lot, but could perhaps be done in lots of
common cases.

Thanks for working on this!

1 Like

No, that shouldn’t have happened. As far as I know, .crate files have always just been renamed .tar.gz files. And those files are created locally before uploading, and I’m pretty sure there shouldn’t have happened any server-side processing on them.

This is really strange. It seems like the hashing difference is due to some of the crate files not using gzip compression.

Looking at cargo-0.79.0.crate:

Source Package File SHA256 Lookaside SHA256 Source
rust-cargo cargo-0.79.0.crate 9057c4167c5ae39b926ba6c77a1becd55785f5170e7b10081b566eacf235dab5 e89822dd6c8eb196a1c5e73ebc063ba8bcbb7970ed8d7933d8431923adeca295

The lookaside file does not seem to be gzipped, but in rust-cargo-0.79.0-1.fc41.src.rpm it has the correct hash.

https://koji.fedoraproject.org/koji/fileinfo?rpmID=38651972&filename=cargo-0.79.0.crate

So perhaps this has to do with accessing the file through https://src.fedoraproject.org/repo/pkgs/* instead of how koji gets the file to build rust-cargo-0.79.0*.src.rpm?

$ wget https://src.fedoraproject.org/repo/pkgs/rust-cargo/cargo-0.79.0.crate/sha512/93867f371a57349b2107b99519cc750f52bbe04896165d5b67ffd3ea243aefec80036640c07652e8d7b1d2037dc6a1bc05d0d0ebadbb212e4cc9628e0e0c90ac/cargo-0.79.0.crate && sha256sum cargo-0.79.0.crate
cargo-0.79.0.crate   100% [=====================================================================================================>]    2.41M   12.51MB/s
                          [Files: 1  Bytes: 2.41M [4.28MB/s] Redirects: 0  Todo: 0  Errors: 0                                    ]
9057c4167c5ae39b926ba6c77a1becd55785f5170e7b10081b566eacf235dab5  cargo-0.79.0.crate
 $ file cargo-0.79.0.crate
cargo-0.79.0.crate: POSIX tar archive (GNU)
$ wget -O cargo-0.79.0.crate https://crates.io/api/v1/crates/cargo/0.79.0/download && sha256sum cargo-0.79.0.crate                     
cargo-0.79.0.crate   100% [=====================================================================================================>]    2.41M   23.39MB/s
                          [Files: 1  Bytes: 2.41M [4.01MB/s] Redirects: 1  Todo: 0  Errors: 0                                    ]
e89822dd6c8eb196a1c5e73ebc063ba8bcbb7970ed8d7933d8431923adeca295  cargo-0.79.0.crate
$ file cargo-0.79.0.crate                                                                                                              
cargo-0.79.0.crate: gzip compressed data, was "cargo-0.79.0.crate", max compression, original size modulo 2^32 11721216

I didn’t know that you had this! Following in the steps of the great :smile:

I’d be happy to write a check for this. I am not super familiar with the existing CI; although I was able to find some docs for testing farm, would you know where I could find the CI job definitions?

Just to be extra clear here: I am worried about MITM between the source and packager introducing poisoned sources to lookaside. dist-git ↔ koji should be good to my knowledge.

Reaching out to maintainers sounds good! Let me know the best way to help with that—I am unsure as to the best way to move forward.

I didn’t know that you had this! Following in the steps of the great :smile:

Great minds think alike? Or something. :wink:

I’d be happy to write a check for this. I am not super familiar with the existing CI; although I was able to find some docs for testing farm, would you know where I could find the CI job definitions?

I’m not really up on CI setup config. You might try asking in
#fedora-ci:fedoraproject.org matrix channel?

Just to be extra clear here: I am worried about MITM between the source and packager introducing poisoned sources to lookaside. dist-git ↔ koji should be good to my knowledge.

yeah, but just because a spec has http:// or ftp:// in the Source line,
the maintainer could (and hopefully should) be getting the source via a
better means. They could get it from a https link, or have some other
way to download it/get it. We just don’t know. ;(

Of course fixing that in the spec to a valid https link would help
anyone who uses ‘spectool -g’ or the like too.

…snip…

Reaching out to maintainers sounds good! Let me know the best way to help with that—I am unsure as to the best way to move forward.

Posting to the devel mailing list might help, many maintainers are
there, but don’t follow things here yet too much sadly. ;(

1 Like

Hm, maybe the way you downloaded the file gets content-type application/gzip or compression: gz in an HTTP header, and your download method decompressed the file on-the-fly?

Yep, it’s because the web server is being smart and telling my http client that the file has been gzipped. Adding a header to not use compression fixes it.

$ wget --header "Accept-Encoding: identity" -O cargo-0.79.0.crate https://crates.io/api/v1/crates/cargo/0.79.0/download && sha256sum cargo-0.79.0.crate
cargo-0.79.0.crate   100% [===============================================================================>]    2.41M    7.23MB/s
                          [Files: 1  Bytes: 2.41M [1.99MB/s] Redirects: 1  Todo: 0  Errors: 0              ]
e89822dd6c8eb196a1c5e73ebc063ba8bcbb7970ed8d7933d8431923adeca295  cargo-0.79.0.crate
$ wget --header "Accept-Encoding: identity" https://src.fedoraproject.org/repo/pkgs/rust-cargo/cargo-0.79.0.crate/sha512/93867f371a57349b2107b99519cc750f52bbe04896165d5b67ffd3ea243aefec80036640c07652e8d7b1d2037dc6a1bc05d0d0ebadbb212e4cc9628e0e0c90ac/cargo-0.79.0.crate && sha256sum cargo-0.79.0.crate
cargo-0.79.0.crate.2 100% [===============================================================================>]    2.41M   12.52MB/s
                          [Files: 1  Bytes: 2.41M [4.29MB/s] Redirects: 0  Todo: 0  Errors: 0              ]
e89822dd6c8eb196a1c5e73ebc063ba8bcbb7970ed8d7933d8431923adeca295  cargo-0.79.0.crate
1 Like

Hi there,

FWIW, our team is working on designing an RPM support for Konflux [1] [2] and as part of that we would like Konflux or its part responsible for storing sources (basically a dist-git lookaside analogue) to be responsible for downloading sources before storing them.

In the ideal situations, a maintainer would reference the sources (URL, checksum, signature if possible) but wouldn’t directly upload to the look-aside cache. Instead, the system would do that on their behalf - ensuring checksum and signature matches maintainer’s expectation and making sure audit trail for the source acquisition process exists (and also generating & storing provenance for the sources as part of the process).

Side note: We are currently working on writing the idea down and we will share it with the Fedora community. We are also about to move our RPM build pipeline prototype to the Fedora community space to do our development there and make it available on the Fedora Konflux instance [2].

[1] https://konflux-ci.dev/
[2] Fedora Konflux Cluster · GitHub

Of course, this could happen in such a way that an automated system is affected by it, but a manual workflow is not.

  1. Foobar 2.0 comes out at http://www.foobar.com
  2. Packager updates package to foobar-2.0 using the legit tarball of the release
  3. Attacker takes over http://www.foobar.com and replaces tarball with a malicious one
  4. Automated system notices discrepancy

Now what? It’s not the current file in the lookaside cache that is malicious, it’s the remote file we are comparing it against. If an automated system goes ahead and gets the “updated” source, we lost.

I think there might be a number of corner cases to this, but overall, I
think it’s a reasonable way to go.