Glibc version in RHEL 10, or: how to save the 🌍 from mojibake

h-vetinari · October 26, 2024, 5:26am

Hey all

TL;DR: I’d like to plead the case for the recently adopted-into-C2y N3366 to make it into RHEL 10, even though this is perhaps a hail mary operation.

I’m opening this here because I don’t really know of a better place to start this discussion. But since RHEL essentially branches off from Fedora at some point, it’s a necessary precondition to get it into Fedora anyway – though I have no idea if the decision where to branch off RHEL 10 has already been made (and if so, I assume it cannot be communicated yet). If that ship has sailed already; too bad. The reason I’m opening this thread is the sliver of hope that it hasn’t yet, and perhaps the small chance that wider discussion of the subject may still influence things.

For some background on myself, I help maintain a large cross-language & cross-platform ecosystem^[1] called https://conda-forge.org, which has >2 billion monthly downloaded packages, the lion’s share of which happens on linux. This is only tangentially related to the story below, except perhaps that it provides a bit of background when I appeal to the role of all of us as stewards^[2] of the computing ecosystem in the wider sense.

Text processing in C is unfortunately a wasteland, and since C is effectively the kind of lingua franca that every other language needs to interface with, this leads to the extreme prevalence of encoding issues we see everywhere. This should have been fixed in the C standard yesterday (or rather yesterdecade), but alas, it didn’t happen until a few weeks ago, and that was only because JeanHeyd Meneide fought for this with the fervor of a million suns for 5 years. Despite all efforts, it unfortunately missed the recent C23 standard, but at least it’s accepted now, which opens the door for implementation in glibc.

Of course, a common response is “just use UTF-8 everywhere, dude”, and this might work in various places, but the ecosystem is a vast place, and unfortunately it doesn’t apply everywhere by a long shot.

Speaking of the wider computing ecosystem, a lot of modern infrastructure is built on top of derivatives of RHEL (CentOS, Alma, Rocky, etc.), because it has proven to be the best baseline w.r.t. longevity, ABI stability and an up-to-date toolchain (which is hugely non-trivial effort, and thanks to all involved there!).

This true especially for anyone needing to do binary distribution. Concrete examples I’m involved with are manylinux (underlying the main binary distribution format for Python packages), and conda-forge (which also has RHEL-derived infrastructure), though I know that other ecosystems have likewise learned from manylinux (or independently came to the same conclusions).

Because glibc is so central to the (OS’s) ABI that it effectively becomes the “clock” measuring the age of any given distribution, and because RHEL is by far the longest-lived and with the most stuff built on top, progress in the ecosystem is effectively discretized by the RHEL lifecycle.

This is because the available glibc features are effectively determined by what the infrastructure baseline offers (which is in turn what package authors will generally target), and this only makes a leap when said infrastructure jumps from one ancient RHEL version to a slightly-less-ancient one (for example, only once RHEL 7 is EOL, glibc features from >2.17,<=2.28 become broadly usable).

In other words, if the functionality from N3366 doesn’t make it into RHEL 10, that equates to losing roughly another 3-5 years until those features can be used broadly (i.e. when RHEL 10 goes EOL in 10+ years; rather than “merely” when RHEL 9 goes EOL).

This is – in short – why the glibc version that ends up in RHEL has a huge impact on something that really affects a huge amount of the (lack of) quality in our digital lives – people not being able to enter their names correctly, corrupted files, outputs, and so much more. And the problem is that the timescales involved in actually fixing these things are colossally big, so losing another few years would be a Real Bummer™.

Now for the inevitable snag: this isn’t implemented in glibc yet. The good news is that glibc is generally very quick to support freshly-standardized features, and JeanHeyd himself (who I’m in loose correspondence with) is planning to get this into glibc 2.41, which is expected in early 2025. This obviously depends on the collaboration and review of the glibc folks, but since much of this is already implemented, I’m hoping-slash-assuming that this will not be the crux of the issue.

So, what I’m looking for here is: inputs from people whether this is at all feasible, support/opposition/discussion about the subject, or sharing this with people who are involved and/or likely affected by this.

Thank you for your time

if you squint a bit, you could call it a distribution without the OS bits. ↩︎
if you’re interested in the kind of hijinks I’m involved with, I wrote a blog about one story that seemed particularly worth telling: https://labs.quansight.org/blog/building-scipy-with-flang ↩︎

fweimer · October 26, 2024, 3:06pm

In general, it’s probably better to raise issues like this on centos-devel, filing RFEs on issues.redhat.com, or perhaps via Red Hat Customer Support or Red Hat Partner Connect.

I don’t think there are plans to implement the conversion functions described in N3366. We already have the iconv family of functions (N3366 seems to confuse the POSIX-defined interface and one particularly implementation, GNU libiconv, which is generally not used on GNU/Linux, and it does not even share code with glibc). The glibc iconv implementation has problems, and addressing those should probably have higher priority because iconv is actually used today (but of course existing users also mean that making changes to the implementation is more difficult).

N3366 is a bit awkward because it does not offer an end-of-stream indicator in the normative specification of the conversion functions (and the start-of-stream indicator is rather implicit, too). Such an indicator is required because character set conversions in general aren’t homomorphisms in the sense that E(ab) is not always equal to E(a)(Eb). ICU solves this with an explicit flush parameter, see the ucnv_fromUnicode documentation. The N3366 behavior appears to be to flush on every call, which means that for stateful encodings, the result of the encoding procedures depends on the buffer sizes involved (larger buffers resulting in fewer flushes).

In general, we expect that serious use of Unicode will have to use a library like ICU anyway because character set conversion is just a minor aspect of it. Usually, you need to identify grapheme clusters boundaries, line breaks, canonical character decompositions, and other aspects. All that goes way beyond identifying which multibyte sequence corresponds to which Unicode codepoint. Based on what’s on CentOS 10 Stream today, RHEL 10 will ship with ICU supported for application use.

h-vetinari · October 26, 2024, 8:26pm

Thanks for the response! Happy to take this anywhere that’s more appropriate.

I don’t think that’s a fair assumption. The author extensively compared APIs, has been publicly speaking on this topic for years at various C/C++ conferences, and has a working implementation that is used commercially. These kinds of questions were why it took so long to get it through the C standards committee.

This is part of the core design requirements of the whole effort indeed, which is where the whole “multibyte” and mbstate_t comes from (and it’s worse, because the standard itself couldn’t handle certain cases). In particular, stateful encodings are explicitly in scope, and a number of them are available in the reference implementation.

It’s true that there are many more things that are required, but C has had such a broken foundation that it has stunted growth of sustainable, performant, cross-platform solutions. FWIW, all of the things you mention are also in scope for the larger std::text effort in C++ by the same author, which is what started this multiyear foray into C in the first-place (which despite being slimmed down spent an eternity in the committee).

Finally, ICU/iconv have several problems themselves that prevent them from solving the problem comprehensively. This from-the-ground-up effort is IMO a major evolution in the capabilities of the substrate that permeates computing everywhere, and has a better shot than anything pre-existing of solving these issues at scale. Of course, predictions are hard, especially about the future.

h-vetinari · May 14, 2025, 4:34am

Just to document the outcome – RHEL ships with glibc 2.39. I had missed that it had branched from Fedora 40 already. I guess we’ll aim for RHEL 11 as the baseline for a world with less encoding issues.

Topic		Replies	Views
Fedora 34 and glibc Ask Fedora	2	738	September 19, 2021
Fedora localization platform migrates to Weblate Project Discussion commops-team	4	374	October 9, 2019
Consider including mingw-clang toolchain and mingw-rust in the repos Project Discussion package-maintainers	2	142	March 26, 2024
Self Intro: Dan Davis Project Discussion commops-team	2	286	May 3, 2020
PyTorch to Fedora Introduction Project Discussion ai-ml-sig	34	5657	December 8, 2023

Glibc version in RHEL 10, or: how to save the 🌍 from mojibake

Related topics