F44 Change Proposal: ibus-speech-to-text pywhispercpp [SelfContained]

ibus-speech-to-text pywhispercpp

Wiki

Announced

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

Summary :open_book:

ibus-speech-to-text 0.7.0 introduces support for OpenAI’s Whisper engine via pywhispercpp (python bindings of WhisperCpp) in addition to the existing Vosk engine.

Owner :open_book:

Detailed Description :open_book:

Key ibus-speech-to-text-0.7.0 Changes:

  • ibus-speech-to-text provides a new backend engine option allowing users to select between Vosk and Whisper engine
  • It has a new GStreamer engine to integrate WhisperCpp into ibus-speech-to-text pipeline
  • It supports multiple Whisper models, including locally installed models and online models downloaded from Hugging Face
  • Automatic locale based model selection when possible
  • UI updates to allow backend switching and model management from setup tool

Feedback :open_book:

Benefit to Fedora :open_book:

This package will bring several benefits to Fedora:

  • Higher accuracy speech recognition
  • Greater flexibility by allowing users to choose between multiple backends

Scope :open_book:

Upgrade/compatibility impact :open_book:

Existing ibus-speech-to-text installations will continue to use the Vosk backend by default. No existing configuration or functionality is removed.

Early Testing (Optional) :open_book:

Do you require ‘QA Blueprint’ support? N

How To Test :open_book:

=== Functionality Test ===

  1. Install required packages:sudo dnf install ibus-speech-to-text

  2. Restart IBus using ibus restart command

  3. Add Speech To Text in input sources

  4. Launch the IBus STT Setup tool from the preferences for a configuration and to download a language model

  5. From Setup tool select Whisper as a backend then select and download Whisper model from list of available model for each locale

User Experience :open_book:

Users will see a new backend option in ibus-speech-to-text settings with a variety of Whisper models.

Dependencies :open_book:

  • pywhispercpp

Contingency Plan :open_book:

  • Contingency mechanism: N/A (Not a system wide change)
  • Contingency deadline: N/A (Not a system wide change)
  • Blocks release? N/A (Not a system wide change)

Documentation :open_book:

N/A (Not a system wide change)

Release Notes :open_book:

\n\nibus-speech-to-text now supports the WhisperCpp speech recognition engine via pywhispercpp, providing improved accuracy and multilingual support.

Last edited by @alking 2026-01-12T17:03:12Z

Last edited by @alking 2026-01-12T17:03:12Z

How do you feel about the proposal as written?

  • Strongly in favor
  • In favor, with reservations
  • Neutral
  • Opposed, but could be convinced
  • Strongly opposed
0 voters

If you are in favor but have reservations, or are opposed but something could change your mind, please explain in a reply.

We want everyone to be heard, but many posts repeating the same thing actually makes that harder. If you have something new to say, please say it. If, instead, you find someone has already covered what you’d like to express, please simply give that post a :heart: instead of reiterating. You can even do this by email, by replying with the heart emoji or just “+1”. This will make long topics easier to follow.

Please note that this is an advisory “straw poll” meant to gauge sentiment. It isn’t a vote or a scientific survey. See About the Change Proposals category for more about the Change Process and moderation policy.

This change proposal has now been submitted to FESCo with ticket #3556 for voting.

To find out more, please visit our Changes Policy documentation.

I had some comments on this. First off, a robust STT tool is really needed in Linux as voice becomes an integral part of communicating with AI systems. Handy.computer tries to do this cross-platform but fails on Fedora-Wayland (keys don’t register). So overall, this uBus-STT is going in the right direction, going into F44.

Concerns:

  1. Upstream abandoned? sudo dnf ibus-speech-to-text shows
Available packages
Name           : ibus-speech-to-text
Epoch          : 0
Version        : 0.6.0
Release        : 1.fc43
Architecture   : noarch
Download size  : 76.3 KiB
Installed size : 282.2 KiB
Source         : ibus-speech-to-text-0.6.0-1.fc43.src.rpm
Repository     : fedora
Summary        : A speech to text IBus Input Method using VOSK
URL            : https://github.com/Manish7093/IBus-Speech-To-Text
License        : GPL-3.0-or-later
Description    : A speech to text IBus Input Method using VOSK,
               : which can be used to dictate text to any application
Vendor         : Fedora Project

with https://github.com/Manish7093/IBus-Speech-To-Text as the project URL. The original upstream is at https://github.com/PhilippeRo/IBus-Speech-To-Text. Manish7093’s changes are merge-ready into PhilippeRo’s repo BUT it’s not clear if the original author maintains the project anymore. Last update was 4 yrs ago and an earlier merge by Manish has gone unanswered. I’ve pinged him.

  1. Is indeed upstream is abandoned and vendor = Fedora Project, then should this fold into a Fedora owned repo (vs Manish7093’s own repo) ? SEO/Google still leads to PhilippeRo’s repo.

  2. Enabling issues on IBus-Speech-To-Text: Regardless of where the live project lives, can we please have issues enabled? Why?
    A) Whisper is nice but it’s already old thanks to nvidia/parakeet-v3 (leaderboard @ https://huggingface.co/spaces/hf-audio/open_asr_leaderboard with low WER and high RTFx i.e. small size, super low word rate and very much real time)

B) I’m seeing iBus-STT stability issues with whisper, especially on large/large turbo. System is powerful (30 core Strix Halo with 128 GB RAM), so something underneath is weird. I also noticed it struggled with quantized GGUFs, so not sure if whisper.cpp is old.

I was going to jump just for fun but noticed the repo issue and SEO taking to the older site, so wanted to sync here if this is for release in Fedora 44. I think this should make it into F44 but lets tidy the other parts for community efforts.

Hi Sid,
Thanks for the review , Let me try to address your points one by one.

Upstream Status

The original upstream (PhilippeRo) has been inactive from past few years, After packaging ibus-speech-to-text for Fedora and starting development work, I did reach out to Philippe by email to coordinate and check out upstream direction. Unfortunately, I didn’t receive any response. To avoid blocking Fedora work and to keep maintenance moving forward, I continued development work in my fork and changed the source to my fork link in ibus-speech-to-text spec file.

That said I agree that clarity here is important. I’m happy to:

  1. Continue to maintaining and developing in the fork for now, and contribute the work back upstream if Philippe resumes development work. OR
  2. Discuss moving the active development to a Fedora-owned namespace if that’s the preferred and cleaner long term path.

I don’t have a strong preference either way, the goal is sustainability and transparency for fedora users and contributors.

Issues disabled
That’s a fair call-out. I have enabled issues.

Speech Models

Whisper was chosen primarily because it’s already familiar to many user’s and also have solid model support. That said, this integration is explicitly backend-pluggable. Nothing here lock ibus-stt to WhisperCpp long term. If better performing engines (like parakeet-v3 or others) available then sure we can start to work on their support.

Stability issues

Thanks for highlighting the stability issues you’re seeing, I will start to investigate it.

1 Like

If I understand correctly, this functionality relies on Whisper models (such as the ones at ggml-org/whisper.cpp / openai/whisper). According to https://github.com/openai/whisper#license the models are published under the MIT license.

Will those models be packaged as well for Fedora or would each user have to download them manually?

Hi @siosm ,
No, these models will not be packaged; Users will need to download the required models manually.

What are the hardware requirements for this? I’m concerned that while these whisper models may indeed provide more reliable results, the cost in hardware time might be prohibitive. For example, is it reliant on the presence of a supported GPU? If so, what models and vRAM are required? Can it run in CPU-only mode and if so, what’s the performance penalty? A TTS engine that takes 30 seconds or more to understand you wouldn’t be of much use.

I also can’t tell from the description if the expectation is that just having ibus-stt installed will prefer the Whisper models by default or if it will require specific configuration to enable.

Hi @sgallagh ,

  • Currently it runs fully in CPU-only mode, the tiny and base whisper models operates close to real time; larger models are slower they could have some delay.
  • Installing ibus-stt alone does not automatically force whisper usage, it is provided as an optional backend speech engine along with vosk; Users can select between whisper and vosk from its setup tool

Hi @matiwari ! Thanks for your work on this package. This came up in the FESCo meeting today and we had some follow up questions:

  1. It looks like these are the whisper models you are targeting:

    ‘tiny’: ‘ggml-tiny.bin’,
    ‘tiny.en’: ‘ggml-tiny.en.bin’,
    ‘base’: ‘ggml-base.bin’,
    ‘base.en’: ‘ggml-base.en.bin’,
    ‘small’: ‘ggml-small.bin’,
    ‘small.en’: ‘ggml-small.en.bin’,
    ‘medium’: ‘ggml-medium.bin’,
    ‘medium.en’: ‘ggml-medium.en.bin’,
    ‘large-v1’: ‘ggml-large-v1.bin’,
    ‘large-v2’: ‘ggml-large-v2.bin’,
    ‘large-v3’: ‘ggml-large-v3.bin’

It looks like the tiny/base ones can work on CPU on 2GB RAM systems with a reasonable speed at the cost of accuracy
 up to the large-* models which are 3GB in size and require 4-5GB RAM and are much slower but more accurate.

Do you have an idea of the accuracy / quality / HW requirements of say tiny or base compared to vosk? Vosk appears to be packaged and I hear there is not intention at this time to package any of the whisper models.

Can the tiny/base ones work in a restricted environment? Would it make sense to package either one for inclusion so it could be used say during the installation phase?

  1. Along the same lines - is it possible to ship just one model? If whisper is a better quality than vosk, do you see whisper becoming the default?

  2. The user experience appears to be the user needing to install this package, and go through the configuration set up, and having to download the model on the first run, is that expected? The primary goal here is improving accessibility, is that correct? How does this align with other upstream projects looking at improving linux desktop accessibility?

  3. Do we lose anything with this change? Would anything break for current users?

Manish is away for a couple of weeks, so let me try to respond on his behalf.

I think it is correct: the 2 smallest models seems to work well for English without acceleration. (After that it gets quite heavy: Manish told me it might be possible to enable rocm say but that would be extra work and complexity right now.)

I think it is better (slightly smarter) than Vosk (for English).

But I think there may be a misunderstanding: the Vosk models also need to be downloaded - there is no difference in this regard.

That is an interesting idea - I think it might possible to package - I guess in principle Fedora allows inclusion of such free content - we would need to check more on the licensing of the models, etc. It is true having a pre-installed model would improve the UX.

(btw the model sizes shown in the UI are the on-disk size not the download size - this might also be a point of confusion or something to clarify in the UI)

There are still some limitations: for example the whisper backend samples audio in 5s chunks I think, whereas vosk can handle more continuous transcription.
I hope maybe some of them can be lifted or improved in the future.
Perhaps down the road we can consider making it the default if people prefer it.

That’s right

I am not sure it is really suitable as an a11y tool tbh, at least at this point, but there is potential I suppose: we would need more feedback from a11y users. At least for visual impairment I think typically they need a screen-reader and tend to find this kind of voice input too awkward and slow still, but it might be applicable to other use-cases perhaps.
And to be fair I am not sure using Input Methods is ideal for a11y in general: ideally speech recognition should be more integrated into the desktop if possible, like onscreen keyboards are, probably. But it is a step forward anyway, and perhaps we can collaborate on improving the a11y experience.

I still consider this Input Method to be somewhat experimental or perhaps an interesting toy in some sense. Or better said like a showcase or tech preview of what is possible. But maybe someone wants to try to integrate pywhispercpp in some other application(s), like for education or some simple a11y UI perhaps.

Anyway it is testable today in F43 and F44 - so one can try it: perhaps you already did by the sound of it, thanks for the thoughtful questions.

Nope it is a net new feature - vosk is still available as the default backend.

Manish: 1. Discuss moving the active development to a Fedora-owned namespace if that’s the preferred and cleaner long term path.

I think it’s cleaner this folds into the Fedora namespace.

The primary goal here is improving accessibility, is that correct? How does this align with other upstream projects looking at improving linux desktop accessibility?

As the voice of the customer, I’d say that’s true in retrospect but let’s be forward looking for a moment. When communicating with continuous-loop AI agents, voice is a viable high-bandwidth alternative to keyboard, even for folks not in the accessibility camp. I’m currently using nvidia parakeet on my mac (then text SSH’d to Fedora) and it’s incredible. It would be good for this to be a 1st class experience within Fedora. Once you have the core plumbing setup thru ibus-stt (vosk or whisper), additional models (parakeet or future) fall into a similar pattern.

Moving on, there should be sane OOB defaults. If picking tiny or small, best to pick the english only models. Medium and above perform well in multi-lingual (enough bits to go around). I didn’t find THAT much of a difference across tiny vs large, probably because even small-parakeet (CPU; 600MB; parakeet-tdt-0.6b-v3) is so incredibly impressive over even Whisper large (3 GB)

Finally, the meta-data around model types needs to improve. Everything is “Lightweight model for Android and RPi”. Clearly not the case.

There is also a bug in the current whisper integration where at times the buffer doesn’t get processed in time. STT is live/starts live but then at some point no text comes out, until you kill the process and suddenly the buffer of text pops out.

Edit: typos

@sid4728: I appreciate the sentiment about the project hosting, but I don’t think this project is actually Fedora specific so like the rest of ibus engines, hosting it upstream on GitHub seems fine in my view, but if there is enough interest it could be moved to an Org perhaps.

There are certainly still some bugs lurking and the setup ui could be improved and simplified I feel. I think help fixing and improving it is very welcome too. Can you open a bug or issue about the descriptions of the models please? And more if you wish that would hope to get them attention.

So it may be a good idea to package the the tiny or base model perhaps to Fedora.

With the speed of AI I guess Whisper is already considered somewhat dated perhaps. Parakeet sounds pretty cool - seems it is CC-BY-4.0 iiuc: there are also other interesting models like Mistral Voxtral and many others sprouting up (though some may have higher hardware requirements). I think it would be great if there was some generic abstraction layer to make it easier to try out these different models easily - currently it seems still quite some work to plumb the different models.