Package repo surgery due to checked-in source archive?

Does anyone remember, a good while ago — IIRC it was around Fedora 26, maybe? — we had an issue where a particular src.fedoraproject.org repo was causing problems with large, slow downloads?

We realized pretty quickly that a source archive had been accidentally committed to the repo, then deleted — but because it was still in the commit history, that didn’t actually make any difference. So someone (possibly @churchyard, possibly @adamwill, possibly someone else entirely) had to go in and perform surgery on the commit history, rewriting both the commit checking in that archive and every subsequent commit, so that the git ref blob containing the large archive was no longer contained in the history?

I was almost sure the package in question was Blender, or if not then possibly Inkscape. But I can’t seem to find any trace of… any of this. The problem report, the discussion that followed, the rewriting of the commit history…

I’m asking because, tonight I discovered we have another such situation with the adobe-source-han-mono-fonts package repo1, which will presumably require similar surgery to correct. So, I wanted to link to the previous such mitigation as an example, in the bugzilla report I’m currently opening. Except I can’t find it!

Notes

  1. The font package’s dist-git repo contains a 123M font archive that’s sha256-checksum-identical to the one that spectool -g downloads for the Source0 URL. But it’s checked in to the repo itself, instead of being uploaded to the lookaside cache. Cloning the repo therefore requires downloading 214MB of data, instead of the < 10MB it should be.
2 Likes

Background: I’m currently in the process of systematically cloning Every. Single. https://src.fedoraproject.org/rpms/* repo — you’re welcome, server admins — and of the ones I’ve gotten through so far (my script is moving alphabetically; currently it’s doing rust-*, and will be for quite some time) and at 214MB adobe-source-han-mono-fonts is the second-largest repo by an order of magnitude; behind it comes elfutils at 65MB and they shrink real fast after that.

The first-largest is golang-modernc-ccorpus2 at 355MB, but that repo contains a generated source package created by downloading an upstream distribution with non-free components and removing those.

I’m not completely sure whether or not an archive like that is supposed to be checked in to the repo or not. If the answer turns out to be no, then there are two repos that need git-liposuction.

Does anyone remember, a good while ago — IIRC it was around Fedora 26, maybe? — we had an issue where a particular src.fedoraproject.org repo was causing problems with large, slow downloads?

We realized pretty quickly that a source archive had been accidentally committed to the repo, then deleted — but because it was still in the commit history, that didn’t actually make any difference. So someone (possibly @churchyard, possibly @adamwill, possibly someone else entirely) had to go in and perform surgery on the commit history, rewriting both the commit checking in that archive and every subsequent commit, so that the git ref blob containing the large archive was no longer contained in the history?

I was almost sure the package in question was Blender, or if not then possibly Inkscape. But I can’t seem to find any trace of… any of this. The problem report, the discussion that followed, the rewriting of the commit history…

Yes, blender, and before it mono.

see:

https://pagure.io/fesco/issue/1848
and
https://pagure.io/releng/issue/7265

I’m asking because, tonight I discovered we have another such situation with the adobe-source-han-mono-fonts package repo1, which will presumably require similar surgery to correct. So, I wanted to link to the previous such mitigation as an example, in the bugzilla report I’m currently opening. Except I can’t find it!

Note that this should probibly be a releng ticket? but I guess a bug on
the package first to let the maintainer know/chime in.

2 Likes

You rock, thanks @kevin !

That was my thinking, bugzilla report first to give the maintainer a heads-up that they need to adjust their workflow, if nothing else. Then a releng ticket filed referencing that bug, and we set the bureaucratic gears in motion…

To that end, bug 2329655 filed.

Regarding golang-modernc-ccorpus2, Referencing Source :: Fedora Docs seems to imply that the modified source archive should be committed to the repo, so that one’s probably OK as-is. (Or the docs should be updated to be clearer about that.)

If anyone’s curious, BTW, doing so requires 7.5GB of disk space, and results in 24202 directories.

The 10 largest, when cloned with a --depth=1 (git history omitted save for the current HEAD contents), are…

$ du -sch * |sort -h |tail -n 11
13M	aubit4gl/
14M	kernel/
14M	qmc2/
15M	file/
17M	rocblas/
21M	texlive/
26M	petsc/
65M	elfutils/
214M	adobe-source-han-mono-fonts/
355M	golang-modernc-ccorpus2/
7.5G	total

That 10-largest cutoff also happens to be an order-of-magnitude breakpoint for repo size. Only the 10 biggest repos exceed 10MB in size. The remaining 24192 repos all weigh in at < 10MB. (In fact there are fewer than 100 with a size that can even be measured in whole numbers of megabytes.)

Releng ticket requesting cleanup of the repo is here: Issue #12487: adobe-source-han-mono-fonts: Request cleanup of upstream sources committed to package repo - releng - Pagure.io

No, it needs to be uploaded to the lookaside cache like any other source archive.

1 Like

What part of the linked text implies it should be committed? I don’t see it.

(Sorry for splitting my reply in two.)

Well, primarily, the fact that it doesn’t say not to commit it, and that it’s listing only a local filename for the Source: entry — typically, local source file paths are ones that are part of the repo.

If the file shouldn’t be committed, then the proportionate messaging feels like it would be a big flashing notice box (metaphorically) to that effect, considering the level of badness in getting it wrong.

Honestly that applies to documentation of the lookaside cache in general, IMHO, as the hows & whys of using it are not laid out especially clearly in the packaging docs. Like I said, given the relative badness of screwing it up, the docs should have (a) a dedicated section explaining why we never commit package sources to the repos and how they should be managed instead, with appropriate levels of intensity and ReveredLiteratureOfChoice-thumping to drive the point home; and (b) links back to that section absolutely everywhere else the handling of upstream sources is discussed. (Like that section I linked to, and the one immediately previous…)

(Doesn’t help that neither of the two tutorials ever mentions it, either. The first one creates a .src.rpm package and scratch-builds it directly to Koji, instead of uploading sources, and the second one only covers local builds.)

Bugzilla bug about the golang-modernc-ccorpus2 repo: bug 2329783

And releng ticket: Issue #12488: golang-modernc-ccorpus2: Request cleanup of upstream sources committed to package repo - releng - Pagure.io