Proposal: a SIG to improve production stability and incident management

Fedora aims in its mission statement to be First:

We are not content to let others do all the heavy lifting on our behalf; we provide the latest in stable and robust, useful, and powerful free software in our Fedora distribution.

At any point in time, the latest Fedora platform shows the future direction of the operating system as it is experienced by everyone from the home desktop user to the enterprise business customer. Our rapid release cycle is a major enabling factor in our ability to innovate.

We prioritise a rapid release cycle; we also prioritise a stable and robust platform which a wide spectrum of users can use as a daily-driver desktop or production server.

This proposal aims to create a SIG which will develop a recommended stream of work to improve our stability and robustness, and to improve our communication with users when problems arise that compromise that stability and robustness.

In posting this I’m looking for:

  • Feedback on the proposal below.
  • Contributors interested in joining a SIG to improve production stability and incident management.

Also, I’m not very familiar with the ins and outs of the Fedora org, so suggestions other than a dedicated SIG are welcome if those would work better.


What is production stability?

A proposed working definition:

We’ve achieved production stability when:

  • a user of a Fedora product[1] …
  • using the stable repos of a supported version (i.e. 42 and 43 at the time of writing)…
  • performing regular system updates…
  • and upgrading to the next major version when it is officially released…
  • does not experience significant breakage to their system due to updates and upgrades.

Where are we currently?

  • Users have been relatively hard hit by problematic updates recently. (Examples: 1 2 3 4 5 6 7)

  • Communication about production incidents is sometimes good, but inconsistent. For example, when there is an incident there may be a “Common Issues” post on Discourse - and the format of these helpful user-oriented posts could serve as a good template - but not all incidents get these posts.

  • It’s not clear that the lessons learned from incidents are systematically reviewed and incorporated into future practice.

Proposal

To establish a Production Stability SIG focused on these goals:

  1. Reducing the rate of breakage-causing updates pushed into the stable repos.

  2. Improving communication with users as to the status of significant incidents – when is a fix expected?; how can users downgrade?; how can they best work around the issue pending a fix?

  3. Understanding the lessons learned from incidents and how they inform the testing and release process.

    • which cycles back to: 1. Reducing the rate of breakage-causing updates pushed into the stable repos.

Proposed initial scope of work for the SIG

  • Define in detail:

    • What constitutes an incident and how do we identify it? (Intuitively, the definition is something like “non-trivial breakage to a non-trivial proportion of users”, but we will want to flesh that out.)

    • The process for tracking incidents (part of it is a Bugzilla ticket; but compared to a standard ticket, an incident involves the extra components of user communication and post-mortem followup).

    • The process for reporting status to users during an incident (how do we gather the necessary information?; what channels do we communicate status to?; how do we follow up on questions?)

    • The process for post-incident followups.

    • What organisational structure do we need to make this work?

    • Interfaces between the Production Stability SIG and other teams and SIGs.

  • Document all the above for review and acceptance by FESCo (is that the appropriate audience?)


Thank you for reading!


  1. We should define what this specifically means: Editions, Atomic Desktops, maybe spins too? ↩︎

6 Likes

On hand hand I would like to see more attention from the community put into the Fedora quality and stability workflows.

On the other - we are at risk of creating a group, with a single purpose of judging the work done by other people. This can get out of hands rather fast, and we will have to think about special measures to prevent it.

Thus, before we get deeper into the discussion, I’d like to ask couple of questions first:

  • How does the goal of the proposed SIG is different from the goals of the Fedora Quality?
  • Have you tried to suggest any of these initiatives to the Fedora Quality before? Were they rejected or declared out of scope?
4 Likes

My first thought when reading the proposal is that we already have a QA team and people working on both avoiding issues and communicating when they happen. Bugs slip through because we don’t have enough people testing and fixing things… If you have time to help with things, maybe it’s better to lend a hand to the existing organizations? The proposed SIG would overlap in responsibilities with the existing teams and workflows.

3 Likes

I think (and open to correction since I’m quite new to Fedora) that the Quality team is more oriented around testing and pre-release practice - which is an important part of production stability - and less so on managing incidents in production.

Clearly there’s an overlap between these things, because what we learn from production feedbacks should feed back into quality assurance practice.

Simple answer - no, but if putting this work within the Quality function makes better organisational sense, I’d be happy to go along with that.

3 Likes

I’d be happy to do that, but I would like to progress our incident management practice.

I’ve tried to proactively help by doing that (for example, on this release issue from June, I took the lead in communicating with users both here and on the KDE forum) but it’s outside any formal organisational structure, and it would be good if there was a better defined and established way of doing that.

1 Like

Fedora Quality’s coverage is across all phases of a Fedora release’s lifecycle. They are the most visible during the development phases, but they definitely engage and support post-GA activities too.

Thanks, I think I see your point now.

And I think I understand where you come from. But there are couple of comments here.

First is that we do indeed have a Common Issues process which includes important issues found post-release

You rightfully point that not all interesting issues get an entry there, but this is where contribution can happen and is welcomed and can happen without a new org structure.
(To be honest I am not sure who currently officially owns the Common Issues process, but we can find out :slight_smile: )

It also would be nice to add some more visibility to this process, for example post a “Top 5 new issues in Fedora in last month” article in Fedora Magazine and so on. That would be a great initiative which can be done for example under Fedora Marketing umbrella (cc @joseph ) or just directly.

The second part is trickier.

Let’s say a regression lands to a component in Fedora.

We noticed it, described as a Common issue, filed a bug for the maintainer, explained it to the end user. Now you would like to have a retrospective, a lesson learned and a way to prevent the issue from happening again. The question is - do you want to contribute the work needed to prevent it? Or do you want to demand it from someone else (probably maintainer)?

If you want to contribute - write a test, participate in testing that component and karma votes - these are all activities done within the realm of the Fedora Quality team. Bodhi karma is one of the most accessible ways to participate in additional gating of the Fedora packages. Writing automated integration tests is the second best. Co-maintaining a tricky component is also a good way to participate. And basically every existing working group in Fedora is working in this direction one way or the other.

But if you want to be responsible for writing lists of requirements to pass to someone else - that’s going to be rather problematic. We are a volunteer-driven project. There is a very little we can really demand from the members of our community. And we have to be careful and not promise quality SLAs or the “enterprise-level incident management” on behalf of someone else doing the unpaid volunteer work.

1 Like

The Mesa incident has been reviewed by Fedora QA 2025-11-24 @ **16:00** UTC - Fedora Quality Meeting - #2 by kparal with what I consider to be a fair resolution.

However, I believe that corresponding discussion Is most recent Mesa push to stable (mesa-25.2.7-2.fc43) reasonable? , despite few attempts to dilute and shut down constructive conversations, has indicated that there may be deeper issues buried within collaboration processes, involving Red Hat development practices and, as @bookwar pointed out, unpaid volunteers no one has any right to demand anything from.

Is it a scope of Fedora QA team to develop propositions to improve said processed, including perspective of unpaid volunteers, or something extra has to be introduced, like the proposed SIG, considering that few people have insight into how Red Hat operates and why.

I wasn’t aware (David Airlie) had added an untested merge request to the stable mesa update.

It can indeed happen. And I’ve contributed myself, both by writing up Common Issues and by communicating to users here about issues outside of the formal Common Issues process. But contributing without an org structure has its drawbacks, in that people can get stuck doing it solo without an organisational mechanism to step in and help them.

(The issue I mentioned from June is one that I spent a lot of my week handling the communications on by myself. I was pretty tired and a bit burned out by the end of that.)

So a more formalised structure (whether it’s part of the Quality team or something else) for doing that should help to encourage contribution there.

I’d say there’s a shared responsibility here, and we do need to work out how we make it happen collaboratively rather than adversarially, It should be neither just making demands of maintainers, nor (let’s say) demanding that “Quality should have tested it better”.

This is part of why the SIG proposal envisages the SIG working to define processes, rather than just presupposing in the proposal what the process should be.

But after all, we should have aligned goals - we all want a stable and robust system and want to contribute in a way that achieves it.

Sadly I’ve broken production systems plenty of times in my life. I never wanted to break production, and when I did I wanted to avoid repeating the same problem - part of which was getting feedback about what went wrong and what could have been done better.

1 Like

Just maybe to clarify:

I am not exactly against the initiative. I am participating in discussion to voice some of the concerns which come to mind, but I haven’t formed the opinion yet.

I also think that there is a lot of value in the coordination work in FOSS projects - the work which falls in-between the established groups and is dedicated to connecting those groups to implement a certain cross-functional project or initiative.

I am just not sure that “a SIG of coordinators” is what we need for it. It might be better to form this group around one specific work item at a time.

Working on a follow-up to “Steam doesn’t work on Fedora” and working on the follow-up to “bootloader in dual-boot setup” issue may attract different people. So it might be you will not have enough folks interested in a more generic “follow-up on any issue no matter what it is” type of work.


It would be nice to hear from other members of the community too. Are there other folks willing to join?

Like Germans say “Treffen sich drei Deutsche, gründen sie einen Verein”. If you get three or more folks interested, the overhead of creating and maintaining a SIG might pay off :slight_smile:

2 Likes

Thanks, understood. And just one clarification of my own:

The conception of the SIG here is to work on defining a process (and then seeking review and approval for it). Part of doing that is figuring out what organisational structure (existing or new) would put that process into effect - I’m not presupposing that the SIG would necessarily be the ongoing operational structure.

1 Like

Thanks for the proposal!

I’m speaking here as a long-time member of the Quality team, and the team lead for the RH group on that team, but this is all my own thoughts, I haven’t run this by the rest of either team at this point.

My opinion based on pretty long experience is that I don’t think we need a heavy team/process setup here. A lot of Fedora SIGs and WGs go idle or wind up being one person operations, and I think this one would be at risk of the same. And I think it would just be unnecessary for the purpose.

We (Quality) absolutely do consider things like this within our area of interest. That’s why it came up at our meeting on Monday, and why I got involved in the other thread. I look at quite a few incidents - usually less noticeable than this one - and try to smooth them along, like updates failing gating tests or getting stuck in updates-testing due to negative feedback. In my experience an informal approach tends to work well, without requiring too much bureaucracy or getting too heated.

Most packagers are easy to get in touch with in many ways - here, or on Matrix, or by email. They’re usually happy to work through problems and talk about ways to avoid them in the future.

In this particular case, the packagers involved have already been active in these discussions and @airlied already raised the most obvious point moving forward - he asked Leigh to avoid pushing updates that have negative karma stable in future. That’s the same thing I would have suggested. I’d also suggest it would have been a bit useful to have more detail on what the backport that caused the problem was intended to achieve: in the end this was kinda moot, but if the issue had dragged on any longer I or another proven packager might have wanted to consider reverting the change, and it would have been easier to make that decision if we’d known exactly what the change was meant to do.

I suspect if we put together a SIG and a process and ran this issue through it, the outcome would be much the same - a recommendation to ask packagers to be very careful manually pushing updates created by other packagers without prior communication, especially if the update has any negative karma.

So my personal opinion is that I’m not sure such a SIG would remain functional in the long term, and if it did, if it would achieve much we don’t already manage by more informal means. But if other folks think it would be a good idea, I’m not going to stand in the way.

5 Likes

To provide a bit of a user and outside perspective from a distro-hopper that more often than not just ends up back on Fedora, I largely agree with your sentiment. Fedora is and remains more stable than a lot of other distros and a SIG like this has never been needed to maintain that.

I personally think this particular incident getting so much attention speaks more to how popular and important Steam has become as a main driver of increased Linux market share and optics than it does to actual quality issues. As a Steam user myself, I question whether the issue would have garnered the same notoriety if it had affected something other than Steam. I feel as though Windows has broken gaming more often in the past year than Fedora has, but its more normalized on that OS.

1 Like

BTW, looking through the issues you cited, I see a few themes.

A lot of them relate to major KDE / Plasma / Qt backports to stable releases. This is a pretty onerous process because there are hundreds of packages involved in these builds, and it’s easy for there to be conflicts with other in-flight updates, missed packages, and stuff.

@ngompa correctly points out that the tooling we have here isn’t great. Doing an update of hundreds of packages with interdependencies involves a lot of manual effort and is easy to get wrong. Better tooling could help with that. This is a generally known issue, though, and a retrospective process for these cases wouldn’t really do much beyond flag up again that we could use better tooling.

Another thing that I’ve talked about with the KDE and GNOME folks before is that the default karma autopush threshold of +3 is not really appropriate for these megaupdates, because there are folks who are very keen to file +1s and tend to do so very quickly, so these big updates that could use more careful testing can get pushed to stable within hours of submission, before there’s really been a chance to kick the tires properly. I’d suggest (again) that the teams should change the autopush threshold to at least +5 for these megaupdates.

I have a more controversial idea, there, too: I think it’d be interesting to force a delay on autokarma pushes. Say, we don’t push an update stable for reaching the autokarma threshold until it’s been in updates-testing at least a day. This is somewhat awkward to implement, though, because karma autopush is currently implemented as an immediate response to karma appearing; adding a delay would require a substantial redesign. And, as I mentioned on the other thread, there’s the contrary pressure when we want to get an update out as fast as possible, to fix a CVE that’s in the public eye, or correct a previous bad update; this idea would mean we’d have to do a manual push in such cases.

6 Likes

I believe we would need to improve how users become aware of issues and how they can follow them, along with responses and possible solutions. Arch Linux has Arch News, a place where they notify users about potential problems with the distro and provide tips for solutions. We could implement something similar for Fedora. Currently, the infrastructure is excellent, but only for those who are already familiar with it — for the end user, it is complex. So having a fixed place where we can publish these relevant problems, possible solutions, and notices would be very appropriate.

4 Likes

I saw that, and thanks for your work on it!

I think this Mesa incident is an atypical one, and I don’t want to overemphasise that specific case as a motivation here. It’s true that it’s the immediate stimulus that prompted me into writing up a concrete proposal, but these are ideas I’ve been mulling over for a few months (probably since the qtwayland incident).

How do you feel about the “Improving communication with users” leg of the proposal?

We’ve talked about the considerations of managing a volunteer-driven project. User support (on Discourse and elsewhere) is largely done by volunteers too - and at least from my perspective of doing it, we could reduce the strain on support volunteers by having a formalised structure of how we organise and share the work during an incident.

If not in a SIG, is the Quality team a venue in which we could work on that part of the proposal?

Thank you very much for the provided insight in the both discussions.

Is it reasonable to compare “updates-testing” to “staging” in generic software development?

Is it expected to have any form of quality control (done by dedicated team or developers themselves) before the new build is submitted into “updates-testing” for a stable Fedora release?

The big thing with the Mesa issue is that it was very hard to pinpoint from the perspective of end-user. There was no errors, warnings in any kind of log files etc. Things just stopped without any movement and indication of issues.

However, as Jean-Baptiste LEPESME pointed out in the upstream discussion related to the merge request, issue was obvious when running “vkcube --validate” command. However coming up with this requires very specific domain knowledge.

Is it reasonable have some form of sanity/smoke tests being mandatory before build is submitted into “updates-tesing” (as I already asked @airlied without any response so far).

This, together with the improvements to build description you mentioned, may significantly contribute in a positive way to making correct push/no-push decision by maintainers and or proven packagers.

Is the above something Fedora QA team considered?

Is this an actionable plan to move forward?

If there were enthusiastic people who want to prevent similar issues happening in the future, we (Quality) would surely welcome them. I don’t think they need to form a separate SIG, just being being part of Quality makes sense to me. In this particular example, those people could look into whether vkcube --validate can be run in a headless environment, or not, and whether we can create a Fedora CI test that runs on every new package build/update to test it.

On a higher level, overhauling how karma autopush works in Bodhi would also be a great initiative, coming up with actual practical suggestions (with past examples showcasing existing problems) for the new workflow. This would require non-trivial change to Bodhi code, though.

What we could do even faster, is to have some kind of tracking of negative karma in Bodhi. Perhaps a bot posting messages to our #quality Matrix channel, or a generated webpage/special view in Bodhi, etc. If we were immediately notified on potential problems, and a lot of people followed those messages, we could focus on those problematic updates very quickly. In the broken Mesa update, if we had additional 3+ people who also tested it with Steam and posted -1 “broken here as well” (one of them ideally filing a bug report with logs), it would probably convince Mesa maintainers that this is not a random accident affecting just one or two people. Feedback is about volume, the more +1/-1s we have, the easier it is to judge whether to stop it or not. So if we had this dedicated crowd of testers who immediately jump on updates with warning signs, we could do much better than now.

Additionally, notice that mesa updates don’t have any test cases linked in Bodhi. It is very easy to write a simple test case that will tell people “in order to test this, try at least some of the following: “1. run glxgears 2. run vkcube (–validate) 3. run some games, including Steam games if you can”. But this requires contributors, we can’t go and write a test case for every package in Fedora (although, we should focus on those critical packages and do it at least for those). There are a lot of tasks we’d love to do, just don’t have the manpower to do them. But we’re very happy to make sure that we have good documentation for these workflows, so that people can participate. (Raising awareness of these participation options is another thing we could improve).

So overall I think there are ample opportunities, and if there are passionate contributors, please tell us and we’ll try to point you into the right direction. If our documentation or processes suck, please also tell us and help us improve it, so that we can gather more contributors in the future :slight_smile:

5 Likes

I am sure that “vkcube --validate” failing indicates that software other than Steam (including some industrial applications) may fail as well. Considered that some of that code, if undetected, may slip into RHEL, should paid non-volunteers do something about it in addition to complaining that non-paid volunteers are not doing it?

This is what developer is expected to do, at least on their own hardware, before making changes to a code base public, asking to merge into upstream, publising to “updates-testing”, isn’t it? At least it is my impression, based on what I and my colleagues usually do developing software professionally.

Looking at the Mesa issue as example, here is build without “backport”. Should “dedicated crowd of testers” (I assume, based on general sentiment, non-paid volunteers) “immediately jump” on it “with warning signs” to have their time spent testing invalidated by sudden “backport”? I imagine same concern applies to any other effort looking to synchronize their releases with Fedora.