Hello everyone,
I am investigating the verification of source code integrity after software is packaged and included in Fedora’s repositories. In the course of this research, I found a potential issue that is worth discussing further.
Summary
No automated checks ensure that the file referenced in remote Source:
URLs are the same as the files actually used by Koji at build-time. Since no such check occurs, a wrong version or maliciously modified source code archive could be built and distributed in place of upstream’s source code archive. This is a possibility even if a packager is acting in good faith, and could result in distribution of malware via the official package repositories.
Analysis of all 35,665 remote files referenced in Fedora’s .spec
files reveal that about 4.5% of remote files do not match what is in the lookaside cache and about 4.9% were unreachable. Specifically looking at files which did not match, were Source0
, and were an archive (2.8% of all files), only 67.1% of files in this subset did not contain any differences in archive content. This leaves us with 333 files that either had differences in content or their source URLs did not return an archive.
Given this data, it is worth investing engineering time to try and improve the situation. For example, a pre- or post-build step could be introduced to verify the integrity of files used during the build.
Problem
The files that package maintainers upload to dist-git do not have to match the original file published by upstream. We do not seem to check hashes for package source code archives outside of the package review process. While there are edge cases where an upstream’s file must be modified or is unversioned, the majority of source files should match their upstream counterparts.
A packager could upload a vulnerable older version of source code or maliciously modified source code. This could cause Fedora to distribute malware. There are multiple possible ways this could happen:
- Packager accidentally uploads the wrong version from upstream
- Attacker takes over an abandoned upstream’s website and host backdoored source archives. Packager does not notice and packages using compromised upstream URL.
- Packager maliciously modifies source code archive before uploading to dist-git
- Attacker could also compromise a packager’s FAS credentials. 2FA isn’t mandatory for packagers.
- Attacker preforms man-in-the-middle attack while a packager is working on a package using insecure protocols.
- 15% of remote files use
http:
orftp:
- 15% of remote files use
Data and Analysis
Comparing Source and Lookaside Cache Files
To explore this, I used spectool to extract the SourceX:
URLs from every single .spec
included in rpm-specs-*.tar.xz
from here. Then, I hashed every remote file and compared them to their counterpart on the lookaside cache. The results are below:
Status Definitions
Status | Definition | Notes |
---|---|---|
Match | SHA256 hash of the source file matched the file on the lookaside cache. | |
Match, >1 version | SHA256 hash of the source file matched at least one version of the file on the lookaside cache. | This is makes sense for unversioned files, but indicates a mistake if observed for a versioned file. |
Source unreachable | The source URL was unreachable. | May occur due to source URL being taken down, networking issues, typos in the spec file, or bugs in my script. |
Lookaside unreachable | Could not find a file on the lookaside cache. | May occur because packager has not uploaded a file yet, bugs in the hashing script, or the file is stored in git instead. |
Mismatch | SHA256 hash of the source file did not match any files found on the lookaside cache. | May occur due to source URL being replaced with another page/file, bugs in my script, hitting bot detection/firewall, or the file being unversioned and the source changing. |
Analyzing a Subset of Mismatches
Although 4.5% of remote files not matching seems bad, running further analysis on files that had a hash mismatch, were Source0
, and were an archive show better numbers:
Diff result | File Count |
---|---|
No difference in content[1][2] | 679 |
File returned was HTML[3] | 101 |
File was archive with differences between files present in both archives[4] | 127 |
Other/Uncategorized | 105 |
Total Files | 1,012 |
Real-World Testing
I added a newline to a file in one of my packages before releasing an update. As you can see in Bodhi, all checks passed and the update is being distributed.
SHA512 (lookaside-numix-icon-theme-24.04.22.tar.gz) = c59440f5fb1d4c4c54d863d9815ae8da0976dab92e079b206025f372664995a38c841618173ffc38871f2c1766ae3bc32e143195fcd71c34d554a1f1f589fe70
SHA512 (source-numix-icon-theme-24.04.22.tar.gz) = d7bee3436c901e1f43b74a0e1312e4d40ce5ebc0b8a681d76c9fc3135c581b9a19252678edeaebece5dbb65afc011cf40dac52e24751f2ee273fd5338dfa8acf
Insecure Protocols
14% of files are fetched using http:
and 0.9% use ftp:
. As mentioned before, this could be problematic due to man-in-the-middle attacks.
Possible Solution
In the short term, a waivable check comparing the hash of remote file hashes against the files in a package’s .src.rpm
could be created.
In the long term, the buildsystem should handle fetching sources for packagers. Outside of edge cases that prevent a file from being retrieved automatically, packagers shouldn’t need to interface with the lookaside cache. There’s a long-standing issue in RPM that I’ve linked below discussing this.
References
- Data gathering scripts and raw result data: GitHub - mymindstorm/fedora-lookaside-verification
- Graphs and merged data: Hex
- RPM discussion regarding adding support for hash verification in RPM: RFE: read sources checksums from the SPEC file and verify them · Issue #463 · rpm-software-management/rpm · GitHub
Corrections
Lastly, please call out any mistakes or bugs. I’m sure that my data isn’t perfect, but it should be a good starting point for a discussion.
This likely means that the date or compression settings were different from upstream’s published archive. I.e. the packager repacked the archive. ↩︎
514 of these mismatches with no difference in content were
rust-*
packages. I am not sure why. ↩︎This presents an issue as a hostile party could take over a website and impersonate upstream or trick someone who is unfamiliar with a given package (e.g. proven packager) to download modified sources. If you look through the data you can find examples of upstream websites that have been taken over by gambling sites. ↩︎
Different version and/or source code was manually modified by packager. Includes false positives where packagers use
main
instead of the same ref that upstream releases the archive with. ↩︎