LLM Providers

UGENT supports multiple LLM providers simultaneously. Each provider is configured as a named instance with its own model, API key, and transport settings.

Basic Setup

toml

[llm]
default_instance = "opus"

[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
max_tokens = 200000
temperature = 0.3

The default_instance determines which provider UGENT uses unless overridden.

Supported Providers

Type	Provider	Example
`openai`	OpenAI	GPT-5.5, GPT-5.4-mini
`anthropic`	Anthropic	Claude Opus 4.8, Sonnet, Haiku
`google`	Google AI	Gemini 3.5 Pro/Flash
`dashscope`	Alibaba Cloud	Qwen 3.7 Max
`openrouter`	OpenRouter	Any model via unified API
`ollama`	Ollama (local)	Llama 4, DeepSeek R2
`openai-compatible`	Any OpenAI-compatible endpoint	DeepSeek, vLLM, LM Studio

Token Limits

toml

[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
max_tokens = 200000           # max output tokens the provider allows
context_window = 200000       # total context window (input + output)
temperature = 0.3

`max_tokens` (Output Limit)

The maximum number of tokens the model can generate in a single response. Set to 0 for auto mode — UGENT calculates a sensible default based on the provider and whether reasoning is enabled.

When max_tokens = 0 and reasoning is enabled, UGENT uses a floor of 16384 to ensure both reasoning tokens and visible output fit.

`context_window` (Total Context)

The total token budget (input + output) the model accepts. UGENT uses this to calculate compression thresholds and projection budgets. If omitted, it defaults to the max_tokens value.

When the conversation approaches context_window, UGENT triggers automatic context compression to summarize older messages and keep the conversation going.

Multiple Instances

toml

[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"

[llm.instances.fast]
type = "openai"
api_key = "$OPENAI_API_KEY"
default_model = "gpt-5.4-mini"

[llm.instances.local]
type = "ollama"
base_url = "http://localhost:11434"
default_model = "llama4:70b"

Switch at runtime with /model or let the orchestrator pick the best one per task.

Failover

Chain fallback instances so UGENT automatically retries on a backup when the primary fails:

toml

[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
fallback_instances = ["fast", "local"]

When the primary instance exhausts all transport retries, UGENT tries each fallback in order.

Reasoning (Extended Thinking)

For models that support extended thinking (Claude, Gemini, OpenAI o-series, DeepSeek):

Modes

toml

[llm.instances.opus.reasoning]
mode = "enabled"             # disabled | enabled | adaptive
effort = "high"              # minimal | low | medium | high | xhigh | max
budget_tokens = 100000       # Anthropic/Gemini only

Mode	Behavior
`disabled`	No extended thinking. Standard completions.
`enabled`	Always think before responding. Reasoning tokens are included in the response.
`adaptive`	Let the provider decide when to think. Only Anthropic and Google support this natively; OpenAI/OpenAI-compatible falls back to `enabled`.

Effort Levels

Effort	Description
`minimal`	Briefest reasoning, fastest response
`low`	Light reasoning
`medium`	Balanced
`high`	Deep reasoning (recommended for complex tasks)
`xhigh`	Extra-deep reasoning
`max`	Maximum reasoning budget

Provider Differences

Anthropic uses native thinking blocks with budget_tokens:

toml

[llm.instances.opus.reasoning]
mode = "enabled"
budget_tokens = 100000       # token budget for thinking

OpenAI uses reasoning_effort in the API request:

toml

[llm.instances.reasoning-pro.reasoning]
mode = "enabled"
effort = "high"
# No budget_tokens — OpenAI manages the budget internally

Google (Gemini) uses generationConfig.thinkingConfig:

toml

[llm.instances.gemini.reasoning]
mode = "enabled"
budget_tokens = 24576         # Gemini thinking budget

OpenAI-compatible (Zhipu, DeepSeek, etc.) may use vendor-specific extensions:

toml

[llm.instances.deepseek.reasoning]
mode = "enabled"
effort = "high"

[llm.instances.deepseek.reasoning.provider.openai_compatible.extra_body]
enable_thinking = true         # DeepSeek V4 thinking toggle

The reasoning output is preserved across turns and provider switches — each thinking block is tagged with its origin provider so context is never lost.

Prompt Caching

Prompt caching is automatic. UGENT places cache breakpoints on the stable system prompt prefix and injects dynamic context (memory, code context, date) after the breakpoint — so cached tokens survive across turns.

toml

[llm.instances.opus.cache]
auto = true                   # auto-activate provider-appropriate strategy

Provider-specific cache telemetry is shown in the TUI status bar.

Transport Reliability

All providers share a unified transport layer with:

Bounded HTTP retries with exponential backoff + jitter
Stream idle watchdog (configurable timeout)
Retry-After header support
Separate interactive and background profiles

toml

[llm.transport]
request_timeout_secs = 120
stream_idle_timeout_secs = 120
http_max_retries = 3
stream_max_retries = 2
initial_backoff_ms = 1000
max_backoff_ms = 30000

Per-instance overrides:

toml

[llm.transport_instance_overrides.fast]
request_timeout_secs = 30

Sub-Agent Cost Control

Reserve expensive models for the parent agent and let sub-agents use cheaper ones:

toml

[llm.instances.opus]
type = "anthropic"
allow_subagent_use = false    # reserved for parent only

[llm.instances.fast]
type = "openai"
default_model = "gpt-5.4-mini"
allow_subagent_use = true     # sub-agents may use this

When any instance opts in with allow_subagent_use = true, delegated sub-agents may only use the opted-in instances. If the orchestrator omits the provider target, it auto-substitutes the first approved instance so the worker never silently inherits the parent's expensive default.

An optional models_allowlist further restricts which model names a sub-agent may select:

toml

[llm.instances.fast]
allow_subagent_use = true
models_allowlist = ["gpt-5.4-mini", "gpt-5.5"]

Per-User Model Selection (Web Channel)

For multi-user web deployments, each user can choose their own model:

toml

[llm.instances.fast]
type = "openai"
default_model = "gpt-5.4-mini"
allow_user_select = true

When any instance opts in, only opted-in instances are selectable by web users. This is default-deny — an instance is user-selectable only when it explicitly sets allow_user_select = true. Operators can toggle at runtime with /providers policy <instance> user-select on|off.

Multimodal Input

Images, audio, video, and documents are supported per provider capability:

toml

[llm.instances.vision]
type = "openai"
api_key = "$OPENAI_API_KEY"
default_model = "gpt-5.5"

[llm.instances.vision.multimodal]
inline_max_bytes = 4194304
max_parts_interactive = 10
fallback_mode = "error"       # error | fallback

The build-time model catalog auto-detects which modalities each model supports.

LLM Providers ​

Basic Setup ​

Supported Providers ​

Token Limits ​

max_tokens (Output Limit) ​

context_window (Total Context) ​

Multiple Instances ​

Failover ​

Reasoning (Extended Thinking) ​

Modes ​

Effort Levels ​

Provider Differences ​

Prompt Caching ​

Transport Reliability ​

Sub-Agent Cost Control ​

Per-User Model Selection (Web Channel) ​

Multimodal Input ​

LLM Providers

Basic Setup

Supported Providers

Token Limits

`max_tokens` (Output Limit)

`context_window` (Total Context)

Multiple Instances

Failover

Reasoning (Extended Thinking)

Modes

Effort Levels

Provider Differences

Prompt Caching

Transport Reliability

Sub-Agent Cost Control

Per-User Model Selection (Web Channel)

Multimodal Input