Skip to content

LLM Providers

UGENT supports multiple LLM providers simultaneously. Each provider is configured as a named instance with its own model, API key, and transport settings.

Basic Setup

toml
[llm]
default_instance = "opus"

[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
max_tokens = 200000
temperature = 0.3

The default_instance determines which provider UGENT uses unless overridden.

Supported Providers

TypeProviderExample
openaiOpenAIGPT-5.5, GPT-5.4-mini
anthropicAnthropicClaude Opus 4.8, Sonnet, Haiku
googleGoogle AIGemini 3.5 Pro/Flash
dashscopeAlibaba CloudQwen 3.7 Max
openrouterOpenRouterAny model via unified API
ollamaOllama (local)Llama 4, DeepSeek R2
openai-compatibleAny OpenAI-compatible endpointDeepSeek, vLLM, LM Studio

Token Limits

toml
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
max_tokens = 200000           # max output tokens the provider allows
context_window = 200000       # total context window (input + output)
temperature = 0.3

max_tokens (Output Limit)

The maximum number of tokens the model can generate in a single response. Set to 0 for auto mode — UGENT calculates a sensible default based on the provider and whether reasoning is enabled.

When max_tokens = 0 and reasoning is enabled, UGENT uses a floor of 16384 to ensure both reasoning tokens and visible output fit.

context_window (Total Context)

The total token budget (input + output) the model accepts. UGENT uses this to calculate compression thresholds and projection budgets. If omitted, it defaults to the max_tokens value.

When the conversation approaches context_window, UGENT triggers automatic context compression to summarize older messages and keep the conversation going.

Multiple Instances

toml
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"

[llm.instances.fast]
type = "openai"
api_key = "$OPENAI_API_KEY"
default_model = "gpt-5.4-mini"

[llm.instances.local]
type = "ollama"
base_url = "http://localhost:11434"
default_model = "llama4:70b"

Switch at runtime with /model or let the orchestrator pick the best one per task.

Failover

Chain fallback instances so UGENT automatically retries on a backup when the primary fails:

toml
[llm.instances.opus]
type = "anthropic"
api_key = "$ANTHROPIC_API_KEY"
default_model = "claude-opus-4-8"
fallback_instances = ["fast", "local"]

When the primary instance exhausts all transport retries, UGENT tries each fallback in order.

Reasoning (Extended Thinking)

For models that support extended thinking (Claude, Gemini, OpenAI o-series, DeepSeek):

Modes

toml
[llm.instances.opus.reasoning]
mode = "enabled"             # disabled | enabled | adaptive
effort = "high"              # minimal | low | medium | high | xhigh | max
budget_tokens = 100000       # Anthropic/Gemini only
ModeBehavior
disabledNo extended thinking. Standard completions.
enabledAlways think before responding. Reasoning tokens are included in the response.
adaptiveLet the provider decide when to think. Only Anthropic and Google support this natively; OpenAI/OpenAI-compatible falls back to enabled.

Effort Levels

EffortDescription
minimalBriefest reasoning, fastest response
lowLight reasoning
mediumBalanced
highDeep reasoning (recommended for complex tasks)
xhighExtra-deep reasoning
maxMaximum reasoning budget

Provider Differences

Anthropic uses native thinking blocks with budget_tokens:

toml
[llm.instances.opus.reasoning]
mode = "enabled"
budget_tokens = 100000       # token budget for thinking

OpenAI uses reasoning_effort in the API request:

toml
[llm.instances.reasoning-pro.reasoning]
mode = "enabled"
effort = "high"
# No budget_tokens — OpenAI manages the budget internally

Google (Gemini) uses generationConfig.thinkingConfig:

toml
[llm.instances.gemini.reasoning]
mode = "enabled"
budget_tokens = 24576         # Gemini thinking budget

OpenAI-compatible (Zhipu, DeepSeek, etc.) may use vendor-specific extensions:

toml
[llm.instances.deepseek.reasoning]
mode = "enabled"
effort = "high"

[llm.instances.deepseek.reasoning.provider.openai_compatible.extra_body]
enable_thinking = true         # DeepSeek V4 thinking toggle

The reasoning output is preserved across turns and provider switches — each thinking block is tagged with its origin provider so context is never lost.

Prompt Caching

Prompt caching is automatic. UGENT places cache breakpoints on the stable system prompt prefix and injects dynamic context (memory, code context, date) after the breakpoint — so cached tokens survive across turns.

toml
[llm.instances.opus.cache]
auto = true                   # auto-activate provider-appropriate strategy

Provider-specific cache telemetry is shown in the TUI status bar.

Transport Reliability

All providers share a unified transport layer with:

  • Bounded HTTP retries with exponential backoff + jitter
  • Stream idle watchdog (configurable timeout)
  • Retry-After header support
  • Separate interactive and background profiles
toml
[llm.transport]
request_timeout_secs = 120
stream_idle_timeout_secs = 120
http_max_retries = 3
stream_max_retries = 2
initial_backoff_ms = 1000
max_backoff_ms = 30000

Per-instance overrides:

toml
[llm.transport_instance_overrides.fast]
request_timeout_secs = 30

Sub-Agent Cost Control

Reserve expensive models for the parent agent and let sub-agents use cheaper ones:

toml
[llm.instances.opus]
type = "anthropic"
allow_subagent_use = false    # reserved for parent only

[llm.instances.fast]
type = "openai"
default_model = "gpt-5.4-mini"
allow_subagent_use = true     # sub-agents may use this

When any instance opts in with allow_subagent_use = true, delegated sub-agents may only use the opted-in instances. If the orchestrator omits the provider target, it auto-substitutes the first approved instance so the worker never silently inherits the parent's expensive default.

An optional models_allowlist further restricts which model names a sub-agent may select:

toml
[llm.instances.fast]
allow_subagent_use = true
models_allowlist = ["gpt-5.4-mini", "gpt-5.5"]

Per-User Model Selection (Web Channel)

For multi-user web deployments, each user can choose their own model:

toml
[llm.instances.fast]
type = "openai"
default_model = "gpt-5.4-mini"
allow_user_select = true

When any instance opts in, only opted-in instances are selectable by web users. This is default-deny — an instance is user-selectable only when it explicitly sets allow_user_select = true. Operators can toggle at runtime with /providers policy <instance> user-select on|off.

Multimodal Input

Images, audio, video, and documents are supported per provider capability:

toml
[llm.instances.vision]
type = "openai"
api_key = "$OPENAI_API_KEY"
default_model = "gpt-5.5"

[llm.instances.vision.multimodal]
inline_max_bytes = 4194304
max_parts_interactive = 10
fallback_mode = "error"       # error | fallback

The build-time model catalog auto-detects which modalities each model supports.

Released under the Private Beta License.