Does Ramalama make AI boring?

Following a discussion we had today I came up with a short blog about ramalama.

What is ramalama?

Ramalama is simply a command line that runs AI models locally by treating them like containers.
Ramalama uses podman or docker to run containers.
It is CPU/GPU optimized and it accelerates performance.

Running AI models using ramalama.

When running the AI models, ramalama uses different transport registries which include ollama, huggingface, Modelscope and OCI registries.

Ollama is the easiest to use. Hugging face and modelscope are not as complicated but OCI registries require authentication in order to use them. For example, ghcr.io which is one of the OCI registries require an authentication token from github to run.
By Njeri Kimaru

Model hallucinations and temperature control.

Different models have different specs and the lighter models seem to hallucinate more than the heavier ones. Some of the lighter models will actually provide no information.

Here’s a light weight model I prompted the four foundations of Fedora;

One way of reducing hallucinations is using --temp 0 flag to make the model deterministic and reduce hallucinations. Temperature control tag ranges from 0 to 1, with lower values increasing model determinism. Uses the 0 as at temp=0, there’s no randomness given the same input, you always get the same output. That’s what makes it deterministic.

For example;

The Merlinite model (4.07GB) looped and hallucinated when asked about Fedora RPM packaging role in Fedora but it provided detailed answers with reference links when it used temperature control.
By Utkarsh_Mishra

Conclusion

While Ollama is praised for ease of use, RamaLama was built as an alternative that allows developers to run and serve AI models while making it easy to put those models in containers and enable local, collaborative, and production benefits. Red Hat

Do you think Ramalama makes AI boring??

4 Likes

Amazing write up

2 Likes

Thanks for sharing, @iamnjeri. My suggestion is that you please reference your work in the text, as well as that of Utkarsh, where you talk about the temp flag he found helpful (–temp 0) and why the 0, and what it does by expanding on it.

1 Like

Thanks @gtfrans2re I will update that.

1 Like

Thanks @farhana

1 Like

Great work @iamnjeri on this one!

I also noticed something similar while testing.

Smaller models were giving answers that sounded confident but were actually wrong, especially for specific questions like the Fedora foundations. Using --temp 0 definitely helped make the output more stable.

But with bigger models like gemma3:12b, the problem wasn’t hallucination as much as it was load time and performance. ramalama run kept timing out, but switching to ramalama serve actually worked better.

It’s kinda interesting how the problem changes depending on the model, smaller ones struggle with accuracy, bigger ones struggle with resources.

It feels less like “using AI” and more like you’re controlling it directly. So yeah, not boring… just a different kind of fun!