Guardrails for too-short RC-to-Go/No-Go

Previously, in the Fedora Linux 34 Final Go/No-Go meeting:

#action bcotton to start a discussion on how to avoid the Extreme Time Crunch Burrito for F35+

What happened is that we had about five hours from the time RC 2 finished until the start of the Go/No-Go meeting. While there weren’t a ton of changes between RC1 and RC2, this is a suboptimal situation. My assignment is to start the conversation so we can avoid this in the future. We did this to ourselves (and I’ll admit that I am a big part of the “we” in this particular case), in part because we have no rules to stop us. While this doesn’t happen every time, it happens enough that we should come up with a way to protect us from ourselves.

As a reminder, we have two competing goals to balance here: releasing on time and releasing according to our criteria. The latter requires effort from our QA community, and we don’t want to put an unreasonable burden on them.

So how do we put the appropriate guardrails in place?

One idea is that if we don’t have a RC request 48 hours before the Go/No-Go, then we consider it an automatic no-go. I have a vague recollection of suggesting that a few releases ago and getting told “nah, it’s okay.” Perhaps 24 hours would be better?

My main concern with this approach is that not all RCs are equal. If we spin a new RC to fix one relatively isolated blocker, it can be fully tested in minutes. The test results from previous release candidates can carry forward to this one without concern. Of course, this becomes a judgment call and how do we decide where to draw the line? It leaves us open to the same issue.

Another idea is to not allow freeze exceptions within a certain point prior to the go/no-go. This reduces the number of changes between RCs and so limits the amount of re-testing that needs to be done.

We could also take a hybrid approach where we say, for example, the RC must happen 24 hours before the meeting, but “minimally different” (however we define that) RCs can happen up to 12 hours before the meeting, with no freeze exceptions allowed for those.

Or maybe there’s an entirely different approach?

1 Like

I can’t speak for others, but I feel a pressure/desire to ship “on time” thats kind of hard to ignore.

I think adding the 2 dates has really helped us, since then you don’t feel as bad missing the ‘aspirational’ date. I wonder: would adding another one help or just add confusion? ie, have a aspirational date, a ‘early’ date and then a ‘planned’ date? (ie, just tack a new week at the end of the schedule). I think that might help us, but confuse users/the press. :slight_smile:

Another thing that may help is to stop fudging the go/no-go meeting. ie, it’s always thursday, if we are no go then, then we are. No holding open, or moving to friday, etc.

Limiting by RC is hard as you note. It could be a new RC fixes just one blocker and all the rest of the testing can carry forward. Or even, we want another RC in case we decide something is a blocker at the go/no-go and want to ship that RC (which we have done in the past). I think we could safely say “no RC by wed, no go” tho as that avoids most of those problems.

Looking forward to hearing others ideas. :slight_smile:

1 Like

I think three would be too much. In order to reduce confusion, I added (at mattdm’s suggestion) explanatory labels to the release days saying “Fedora {contributors,users} plan on this” for F35+. Adding a third date feels a little bit like “if we just keep putting dates on the schedule, we’re never actually late.” In particular, because it would fall in the next month, it would seem like a bigger slip to the public.

As much as I rah-rah shipping on time, it’s not the end of the world if we’re a week late. If we have a policy that prevents us from releasing on time because there are too many blockers, then we should just try to not have bugs next time. :slight_smile:

I think we’ll always do what we can to try to get the release out on time, which is why we need an actual policy here. Otherwise, no matter how many times we say “let’s not do this again”, we’ll end up doing it again.

There could be dual RC rules:

  • There must be an RC 48 hours before G/NG.
  • An RC must have 12 hours of testing prior to G/NG to be considered for release.
3 Likes

Just to make sure I understand, you’re suggesting “there must be any RC 48 hours ahead of time, but not necessarily the RC we pass judgment on”? So if RC1 is done on Monday, we could have RC2, 3, etc so long as the last one is done by 0400 UTC Thursday?

If so, I think that’s a reasonable approach. RCs done a day apart are likely to be pretty similar in most cases, and we could still say “RC15 is too different, so let’s pass until we have more time to test”.

Yes and yes.

1 Like

I see a need to balance the “human” part of the process with concrete, clearly-defined boundaries for what happens during a release. Others have weighed in on the concreteness of what the guardrails might look like in terms of RCs, Go/No-Go meetings, and the like, so I will not add to what was already said.

My insight is creating a human-review process that considers the unique context of any given release. If there are bugs that require more testing than a quick-fix where test results can be carried over, we need a human being (or group of human beings) who understand the Fedora Project, the release schedule, and our community. Because guardrails are usually made of steel or heavy metal, so they are not very good at adapting or changing in real-time. :grinning_face_with_smiling_eyes:

I am wondering, who has the power and authority to say “we will not ship Fedora 42 because of BZ XXXYYY.” Because whoever normally gets to make that call, should be a part of this conversation. Maybe they already are and I am oblivious, but then that likely means a documentation “easyfix” :wink:

Wonder no more! It’s a collective decision: QA:SOP blocker bug process - Fedora Project Wiki

They are (or at least they, all got the same email pointing them to this thread). And yes, it will be an easyfix once we know what we’re updating the documentation with. That’s the hard part.

2 Likes

This is a good scheme.

When Freeze Exceptions and Blockers are evaluated, is it possible to assign a minimum test time each is required before a G/NG meeting can make a decision?

I like this approach. The RCs are usually not that different. Which means the minimum verification time should be larger for the first RC (because it also applies for the subsequent RCs, they share most of the code), and the following RCs might have the minimum verification time reduced. Then it’s about setting the right values - 48 hours and 12 hours seem like decent options.

Of course these would be just the “minimum” rules. We could still say “the latest RC has been under testing for the last 12+ hours, but we’re not confident about it, let’s slip and give it more testing time”.

1 Like

Seems like the main idea is to shift the burden of deciding what is enough RC testing away from a person and onto a policy. Because no one wants to be the person who causes a slip, even though a one week slip is not a big deal. We might only need one rule to achieve that. If we’re dithering over which RC to accept, we’re in a pretty good place :slight_smile: even if it has only had a few hours testing. And we have had cases where we have truly isolated low risk FE fixes that just land late, even day of G/NG.

So here is another idea:

We have the “last minute block bugs” policy of 5 days before G/NG. We could make one cutoff time for both the “last minute blocker” and “first RC must appear by”. Five days out from G/NG might be too long for first RC, but maybe it’s a bit too long for last minute blockers too? They are somewhat related because we can’t have an RC until all blockers are resolved. Two days might be a bit too tight, because even this time around we were nervous about having more than just a couple days to test shim (which admittedly is an exceptional case). But maybe 3 or 4 days?

To be clear, I’m suggesting the same cutoff time for both 1st RC and “late” block bugs. And not having a secondary minimum test time for subsequent RCs. It’s a bit simpler because it’s one less rule, and also merges with an existing one.

From RelEng point of view, Go/No-Go should happen on Thu, that avoids RelEng working on weekends to get the content synced to mirrors rather than working on it on Monday and hope mirrors gets the latest content, this becomes much more important for 0-day blockers.

As a community member, I think a compose needs to have at least 48 hours of testing for RC 1.1 and the following RC’s can have ~24 hours of testing (approx because, I dont want to test a single blocker for 24 hours and delay the release).

2 Likes

Exactly! Much better to blame the rules! :slight_smile:

That’s not entirely true. We sometimes have an RC request knowing there are unresolved blockers because we may choose to waive them. I don’t know that 5 days is the right number for “last minute blockers”, but that’s also because I don’t know if there is a right number. It’s basically a question of how we want to balance two competing interests.

Right. We can always say we’re “no-go” no matter how long we’ve had the RC, so I don’t think we need to be concerned with exceptional cases here.

I’m not enthusiastic about this idea. It is simpler, but I think it may be too simple. And it wouldn’t have prevented the issue that sparked this discussion, because that was RC 1.2. I’m okay with the idea of shortening the “last minute blocker” window to match the “must have the first RC” window. But I do think it’s worth having a separate minimum age for subsequent RCs.

24 hours seems like a lot because it represents the potential of a week delay for the “simple to fix, simple to test” blockers. If that’s the consensus, I won’t argue against it. If I recall correctly, it’s pretty rare for us to spin a new RC with just a single, low-impact update and no FEs. So maybe it doesn’t really matter. But in those cases, I think a “Mohan or Kevin kicks off an RC compose as they leave work for the day and it’s ready for the folks in Europe to start batting around when they sit down in the morning” is a reasonable window for most 1.not-one RCs.

One side effect of the various proposals being discussed here is that we may have more “throwaway” RCs that we know won’t quite clear the bar but are close enough to start filling out test matrices. What sort of impact does that have on RelEng’s workload?

Well, IMHO, the nightly composes are for this no? We wouldn’t have a RC with known blockers requested would we? I know we used to have TCs (test composes), but the nightly composes really took over for that, IMHO.

We would if we expect that we might choose to waive a known blocker. For example if there’s a blocker that seems like it’ll take a month to fix but it’s not particularly harmful, we might decide to waive it under the “it’s too hard!” exception. But we can’t decide to do that (or, more correctly, we want to make that decision as late as possible) if there’s not an RC to decide on.

My understanding is that the test matrices don’t really count unless it’s an RC, but that’ might be something QA could modify. @adamwill, I’m sure, has a lot to say on this.

I see the value of testing on a nightly compose and a candidate compose that isn’t really expected to be releasable as the same. We used to do more candidate composes, because we were less efficient at pushing fixes stable. Now we tend to push fixes stable quite efficiently I just don’t bother requested as many candidate composes because it is not necessary.

So for the purposes of this discussion, would you start filling in the test matrices for the eventual go/no-go decision based on nightlies? I’m thinking of cases like where a blocker only affects, say, KDE Plasma, so you could basically get the Server and Workstation and Cloud and … testing done and on the books.

We already transfer results from previous builds where necessary and judged appropriate/safe. Including from nightly builds.

Okay, that was the missing piece for me. I knew you carried RC results forward conditionally, but I didn’t know you did for nightlies, too.