LLM Providers
UGENT supports multiple LLM providers simultaneously. Each provider is configured as a named instance with its own model, API key, and transport settings.
Basic Setup
[llm]
default_instance = "opus"
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
max_tokens = 200000
temperature = 0.3The default_instance determines which provider UGENT uses unless overridden.
Supported Providers
| Type | Provider | Example |
|---|---|---|
openai | OpenAI | GPT-5.5, GPT-5.4-mini |
anthropic | Anthropic | Claude Opus 4.8, Sonnet, Haiku |
google | Google AI | Gemini 3.5 Pro/Flash |
dashscope | Alibaba Cloud | Qwen 3.7 Max |
openrouter | OpenRouter | Any model via unified API |
ollama | Ollama (local) | Llama 4, DeepSeek R2 |
openai-compatible | Any OpenAI-compatible endpoint | DeepSeek, vLLM, LM Studio |
Token Limits
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
max_tokens = 200000 # max output tokens the provider allows
context_window = 200000 # total context window (input + output)
temperature = 0.3max_tokens (Output Limit)
The maximum number of tokens the model can generate in a single response. Set to 0 for auto mode — UGENT calculates a sensible default based on the provider and whether reasoning is enabled.
When max_tokens = 0 and reasoning is enabled, UGENT uses a floor of 16384 to ensure both reasoning tokens and visible output fit.
context_window (Total Context)
The total token budget (input + output) the model accepts. UGENT uses this to calculate compression thresholds and projection budgets. If omitted, it defaults to the max_tokens value.
When the conversation approaches context_window, UGENT triggers automatic context compression to summarize older messages and keep the conversation going.
Multiple Instances
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
[llm.instances.fast]
type = "openai"
api_key = "$OPENAI_API_KEY"
default_model = "gpt-5.4-mini"
[llm.instances.local]
type = "ollama"
base_url = "http://localhost:11434"
default_model = "llama4:70b"Switch at runtime with /model or let the orchestrator pick the best one per task.
Failover
Chain fallback instances so UGENT automatically retries on a backup when the primary fails:
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
fallback_instances = ["fast", "local"]When the primary instance exhausts all transport retries, UGENT tries each fallback in order.
Reasoning (Extended Thinking)
For models that support extended thinking (Claude, Gemini, OpenAI o-series, DeepSeek):
Modes
[llm.instances.opus.reasoning]
mode = "enabled" # disabled | enabled | adaptive
effort = "high" # minimal | low | medium | high | xhigh | max
budget_tokens = 100000 # Anthropic/Gemini only| Mode | Behavior |
|---|---|
disabled | No extended thinking. Standard completions. |
enabled | Always think before responding. Reasoning tokens are included in the response. |
adaptive | Let the provider decide when to think. Only Anthropic and Google support this natively; OpenAI/OpenAI-compatible falls back to enabled. |
Effort Levels
| Effort | Description |
|---|---|
minimal | Briefest reasoning, fastest response |
low | Light reasoning |
medium | Balanced |
high | Deep reasoning (recommended for complex tasks) |
xhigh | Extra-deep reasoning |
max | Maximum reasoning budget |
Provider Differences
Anthropic uses native thinking blocks with budget_tokens:
[llm.instances.opus.reasoning]
mode = "enabled"
budget_tokens = 100000 # token budget for thinkingOpenAI uses reasoning_effort in the API request:
[llm.instances.reasoning-pro.reasoning]
mode = "enabled"
effort = "high"
# No budget_tokens — OpenAI manages the budget internallyGoogle (Gemini) uses generationConfig.thinkingConfig:
[llm.instances.gemini.reasoning]
mode = "enabled"
budget_tokens = 24576 # Gemini thinking budgetOpenAI-compatible (Zhipu, DeepSeek, etc.) may use vendor-specific extensions:
[llm.instances.deepseek.reasoning]
mode = "enabled"
effort = "high"
[llm.instances.deepseek.reasoning.provider.openai_compatible.extra_body]
enable_thinking = true # DeepSeek V4 thinking toggleThe reasoning output is preserved across turns and provider switches — each thinking block is tagged with its origin provider so context is never lost.
Prompt Caching
Prompt caching is automatic. UGENT places cache breakpoints on the stable system prompt prefix and injects dynamic context (memory, code context, date) after the breakpoint — so cached tokens survive across turns.
[llm.instances.opus.cache]
auto = true # auto-activate provider-appropriate strategyProvider-specific cache telemetry is shown in the TUI status bar.
Transport Reliability
All providers share a unified transport layer with:
- Bounded HTTP retries with exponential backoff + jitter
- Stream idle watchdog (configurable timeout)
Retry-Afterheader support- Separate interactive and background profiles
[llm.transport]
request_timeout_secs = 120
stream_idle_timeout_secs = 120
http_max_retries = 3
stream_max_retries = 2
initial_backoff_ms = 1000
max_backoff_ms = 30000Per-instance overrides:
[llm.transport_instance_overrides.fast]
request_timeout_secs = 30Sub-Agent Cost Control
Reserve expensive models for the parent agent and let sub-agents use cheaper ones:
[llm.instances.opus]
type = "anthropic"
allow_subagent_use = false # reserved for parent only
[llm.instances.fast]
type = "openai"
default_model = "gpt-5.4-mini"
allow_subagent_use = true # sub-agents may use thisWhen any instance opts in with allow_subagent_use = true, delegated sub-agents may only use the opted-in instances. If the orchestrator omits the provider target, it auto-substitutes the first approved instance so the worker never silently inherits the parent's expensive default.
An optional models_allowlist further restricts which model names a sub-agent may select:
[llm.instances.fast]
allow_subagent_use = true
models_allowlist = ["gpt-5.4-mini", "gpt-5.5"]Per-User Model Selection (Web Channel)
For multi-user web deployments, each user can choose their own model:
[llm.instances.fast]
type = "openai"
default_model = "gpt-5.4-mini"
allow_user_select = trueWhen any instance opts in, only opted-in instances are selectable by web users. This is default-deny — an instance is user-selectable only when it explicitly sets allow_user_select = true. Operators can toggle at runtime with /providers policy <instance> user-select on|off.
Multimodal Input
Images, audio, video, and documents are supported per provider capability:
[llm.instances.vision]
type = "openai"
api_key = "$OPENAI_API_KEY"
default_model = "gpt-5.5"
[llm.instances.vision.multimodal]
inline_max_bytes = 4194304
max_parts_interactive = 10
fallback_mode = "error" # error | fallbackThe build-time model catalog auto-detects which modalities each model supports.