Proposal: a SIG to improve production stability and incident management

I’m not sure it’s clear that they’re necessarily inherently more dangerous than other kinds of updates. After all, in @pg-tips 's list in the first post, this is the only one that’s like that. All the other bugs listed came from changes already merged upstream, or were downstream packaging issues.

I’d say the calculus about what’s a potentially ‘risky’ update would have to be a whole heck of a lot more complex, at which point the value of doing it becomes a bit murkier.

Honestly, I’d suggest we should do something much simpler and dumber. We should edit the generic “how to install” message on Bodhi update pages to include a warning that installing test updates is risky and you shouldn’t do it without a clear reason and a rollback plan. We could also make this clearer in the wiki page.

And (dons flame-retardant underwear) also we should get everyone on atomic installs already so you can roll stuff back as easy as pie. I’ve been running Silverblue on my main system for nearly a year, it’s mostly great…

1 Like

I sent a couple of speculative proposals for tweaking the Bodhi autopush defaults, please take a look and see what you think:

1 Like

Is it feasible to explicitly mention fast-track-no-upstream-yet (or something else) on “Details” page in Bodhi for such builds, so people who may need this information to make push decisions don’t have to spend a lot of time and dig deep to check every new patch in all relevant builds?

Feasible? Sure. The description of an update is free text; the person creating it can write whatever they like there. Maintainers have wildly different practices for it. I write novels, like I do here. Some maintainers write nothing or “new version X”.

I do think the update description on the mesa update could have been better for sure.

1 Like

The work to support snapshot-on-update and easy rollback should also help for non-atomic Fedoras.

4 Likes

I think it’s useful to track incidents separately from Common Issues because an incident is different than an issue. If we had broken mesa update in a pre-release version of Fedora, it wouldn’t be an incident: it would just be a regular mesa update. The problem in this case was that it got into a stable version of Fedora, despite being flagged as broken before it was pushed. An incident review process – different from Common Issues – can help us figure out how this happened and what steps are required to prevent re-occurrence. Without tracking and follow-up, we’ll just keep repeating the same incidents again and again.

While great for single-package updates, it doesn’t work at all for multi-package updates, where changes to multiple packages have to be batched together and tested at once. I’m not sure how we would change that, with pipelines tied to a particular merge request. And if developers cannot rely on it for our general workflows, then I’m afraid we’re probably much less likely to rely on it in the cases where we are able to do so…

Although of course we do want to encourage users to test our updates.

We do, but we want people who understand that that’s what they’re doing. Not people who are under the impression it’s just a handy way to get fixes early.

We could also add post-build checks at the Koji level that would mark a build as failed if they fail policy or suitability checks.

The openSUSE folks do this for their packages so that it’s not even possible to submit updates. And it’s completely independent of contribution flow (which is super-important since pull requests are used by a minority of packages due to the pointlessness of them when there’s only a single maintainer). For us, we could make it an extensible model that also invokes tmt of whatnot for extra stuff.

But we should be marking builds as failed if they fail tests, not waiting until it gets to Bodhi where it can be ignored.

1 Like

I don’t like that idea. We’ve got 20 years of history that a failed build in Koji is a failed build, not a build that worked but doesn’t pass some sort of test. When I look through the Koji build history that’s the information I want to get.

1 Like

The fact that it can be ignored at Bodhi level is an intentional policy decision. If we wanted to we could gate on all the tests and disable waivers, it would make the design simpler.

We don’t gate on all the tests because they’re not reliable enough, and we allow waiving because (especially after the first mildly disastrous attempt at implementing gating) maintainers would not accept a system where they couldn’t waive failures.