Host a local model (maximum privacy)#
This guide shows how to run gmuse against a local LLM so your prompts and responses stay on your machine.
gmuseuses LiteLLM under the hood.Local hosting means you operate the model runtime (no gmuse backend servers).
If you haven’t already, read the privacy overview: Privacy & Security.
When to use a local model#
Local hosting is a good fit when you want:
Maximum privacy (no third-party API calls)
Offline usage (after the model is downloaded)
Predictable costs (no per-request billing)
It may be a poor fit when you need:
The highest quality for complex reasoning (local models may be weaker than frontier hosted models)
Very large context windows
Zero maintenance (you are responsible for updates and security)
Security checklist (read first)#
If you follow only one section, follow this one.
Bind locally by default: keep the inference server on
127.0.0.1(localhost).Do not expose the port publicly unless you understand the risks.
If you must access it remotely:
Put it behind authentication.
Use TLS.
Restrict access (VPN, firewall rules, allowlists).
Disable debug logging in sensitive environments:
Avoid
GMUSE_DEBUG=1and DEBUG-level logs because prompts/diffs may be written to logs.
Treat model downloads like binaries:
Prefer official sources.
Keep the runtime and models updated.
Review licensing before redistribution or commercial use.
Recommended “golden path”: Ollama + LiteLLM#
Ollama is a simple local LLM runtime. LiteLLM supports Ollama models via the ollama/ (or ollama_chat/) model prefix.
1) Install and run Ollama#
Follow Ollama’s official install instructions for your OS:
https://ollama.com/download
Ensure the Ollama server is running and listening on localhost (default: http://localhost:11434).
2) Pull a model#
Pick a model that’s strong at short, structured text generation.
Examples (choose one):
$ ollama pull qwen2.5-coder
$ ollama pull llama3.1
Notes:
Smaller models (around 7B–8B) are typically a good speed/quality balance for commit messages.
Model names vary by runtime; use
ollama listto see what you have locally.
3) Point gmuse at the local model#
You can configure a local model either temporarily (environment) or persistently (config file).
Option A — Environment variable (one shell session):
$ export GMUSE_MODEL="ollama/qwen2.5-coder"
$ gmuse msg
Option B — Config file (persistent):
Add to ~/.config/gmuse/config.toml:
model = "ollama/qwen2.5-coder"
Then run:
$ gmuse msg
4) Verify configuration#
gmuse info prints the resolved model and provider heuristics:
$ gmuse info
If your provider is detected as ollama, your GMUSE_MODEL is being interpreted as a local runtime model.
Generalize to other local backends#
Ollama is the simplest path, but it’s not the only one.
Two common patterns work well with gmuse:
Pattern A — Use a LiteLLM provider prefix#
If LiteLLM supports your local backend directly, set GMUSE_MODEL to the appropriate provider-prefixed model name.
Examples:
ollama/<model>ollama_chat/<model>(often better chat-style responses)
See the LiteLLM providers list:
https://docs.litellm.ai/docs/providers
Pattern B — Use an OpenAI-compatible endpoint (advanced)#
Some local servers expose an OpenAI-compatible API.
Best practices for this approach:
Prefer a LiteLLM provider that matches your backend (when available) instead of the generic OpenAI-compatible route.
If your OpenAI-compatible client requires an API key even locally, use a non-sensitive placeholder and enable authentication on the server if it’s reachable beyond localhost.
LiteLLM reference:
https://docs.litellm.ai/docs/providers/openai_compatible
Troubleshooting#
gmuse says no provider is configured#
Run:
$ gmuse info
Common causes:
GMUSE_MODELisn’t set and no provider API key env var is set.GMUSE_MODELis set to a value LiteLLM doesn’t recognize.
Connection errors#
If you see errors connecting to Ollama:
Verify the server is running (default address
http://localhost:11434).Confirm the model exists locally:
ollama list.Ensure you can reach the server from the same environment you run
gmusefrom.
Output is too long or too random#
Tune these settings:
Lower temperature for more deterministic messages
Reduce max tokens for shorter outputs
Example config:
temperature = 0.2
max_tokens = 200
See also: Configure gmuse and the Configuration Reference.