Fedora aims in its mission statement to be First:
We are not content to let others do all the heavy lifting on our behalf; we provide the latest in stable and robust, useful, and powerful free software in our Fedora distribution.
At any point in time, the latest Fedora platform shows the future direction of the operating system as it is experienced by everyone from the home desktop user to the enterprise business customer. Our rapid release cycle is a major enabling factor in our ability to innovate.
We prioritise a rapid release cycle; we also prioritise a stable and robust platform which a wide spectrum of users can use as a daily-driver desktop or production server.
This proposal aims to create a SIG which will develop a recommended stream of work to improve our stability and robustness, and to improve our communication with users when problems arise that compromise that stability and robustness.
In posting this I’m looking for:
- Feedback on the proposal below.
- Contributors interested in joining a SIG to improve production stability and incident management.
Also, I’m not very familiar with the ins and outs of the Fedora org, so suggestions other than a dedicated SIG are welcome if those would work better.
What is production stability?
A proposed working definition:
We’ve achieved production stability when:
- a user of a Fedora product[1] …
- using the stable repos of a supported version (i.e. 42 and 43 at the time of writing)…
- performing regular system updates…
- and upgrading to the next major version when it is officially released…
- does not experience significant breakage to their system due to updates and upgrades.
Where are we currently?
-
Users have been relatively hard hit by problematic updates recently. (Examples: 1 2 3 4 5 6 7)
-
Communication about production incidents is sometimes good, but inconsistent. For example, when there is an incident there may be a “Common Issues” post on Discourse - and the format of these helpful user-oriented posts could serve as a good template - but not all incidents get these posts.
-
It’s not clear that the lessons learned from incidents are systematically reviewed and incorporated into future practice.
Proposal
To establish a Production Stability SIG focused on these goals:
-
Reducing the rate of breakage-causing updates pushed into the stable repos.
-
Improving communication with users as to the status of significant incidents – when is a fix expected?; how can users downgrade?; how can they best work around the issue pending a fix?
-
Understanding the lessons learned from incidents and how they inform the testing and release process.
- which cycles back to: 1. Reducing the rate of breakage-causing updates pushed into the stable repos.
Proposed initial scope of work for the SIG
-
Define in detail:
-
What constitutes an incident and how do we identify it? (Intuitively, the definition is something like “non-trivial breakage to a non-trivial proportion of users”, but we will want to flesh that out.)
-
The process for tracking incidents (part of it is a Bugzilla ticket; but compared to a standard ticket, an incident involves the extra components of user communication and post-mortem followup).
-
The process for reporting status to users during an incident (how do we gather the necessary information?; what channels do we communicate status to?; how do we follow up on questions?)
-
The process for post-incident followups.
-
What organisational structure do we need to make this work?
-
Interfaces between the Production Stability SIG and other teams and SIGs.
-
-
Document all the above for review and acceptance by FESCo (is that the appropriate audience?)
Thank you for reading!
We should define what this specifically means: Editions, Atomic Desktops, maybe spins too? ↩︎