Swival: A Coding Agent Built to Not Break on Small Models

5 min read
ai developer-tools

What Is Swival

Swival is a CLI coding agent built by Frank Denis (jedisct1, creator of libsodium). Pure Python, no framework. 575 commits, actively maintained.

The tagline: “A coding agent for any model.”

Most coding agents — Claude Code, Codex, Hermes, Aider — are built and tested against frontier models with 128K+ context windows. They assume clean tool calls, reliable instruction following, and plenty of room. When those assumptions break, the agent falls apart in ways that look like the model’s fault but are really the agent’s.

Swival takes the opposite approach. It assumes the model will have tight limits and rough edges, then does the extra work to keep the task moving anyway.

Source: github.com/swival/swival

Provider Support

Swival connects to eight provider types out of the box:

ProviderAuthZero Config
LM Studiononeauto-discovers loaded model
HuggingFace Inference APIHF_TOKENrequires --model
OpenRouterOPENROUTER_API_KEYrequires --model
Google GeminiGEMINI_API_KEYrequires --model
ChatGPT Plus/Probrowser OAuth on first runrequires --model
AWS BedrockAWS credential chainrequires --model
Generic (OpenAI-compat)optional OPENAI_API_KEYrequires --model + --base-url
Command (external program)nonerequires --model as command string

The generic provider covers ollama, llama.cpp, mlx_lm.server, vLLM, DeepSeek API, and anything that speaks the OpenAI chat completions protocol. The command provider can even wrap codex exec or custom scripts as a backend.

Swival’s userbase skews toward smaller and local models:

Local (primary targets):

  • Qwen3-Coder-Next — top recommendation, great quality/speed on consumer hardware
  • Qwen3.5-35B-A3B — via vLLM or HuggingFace Endpoints
  • Gemma 4 26B A4B — via llama.cpp (GGUF)
  • Qwen3-Coder-480B-A35B-4bit — via mlx_lm.server
  • Qwen3 32B — via ollama

Cloud:

  • GLM-5 via HuggingFace or OpenRouter (note: z.ai’s non-standard URL path /api/coding/paas/v4 is incompatible — Swival auto-appends /v1 to the base URL)
  • Gemini 2.5 Flash via Google
  • GPT-5.4 via ChatGPT Plus/Pro (no API key needed — browser OAuth)
  • Claude Opus 4.6 via AWS Bedrock
  • DeepSeek Chat via DeepSeek API
  • Qwen3-Coder-480B via NVIDIA NIM (generic provider — rate limit issues, see below)

The heavy Qwen3 bias makes sense — Qwen3 has strong tool-calling at small quantized sizes, which is exactly where Swival shines.

Installation

uv (Python 3.13+):

Terminal window
uv tool install swival
uv tool upgrade swival

Homebrew (macOS):

Terminal window
brew install swival/tap/swival
brew upgrade swival

No npm, no Docker, no binary download. Pure Python, minimal dependencies.

Quick Start

With LM Studio (zero config)

  1. Install LM Studio and load a model with tool-calling support
  2. Start the LM Studio server
  3. Run:
Terminal window
swival "Refactor the error handling in src/api.py"

Swival auto-discovers the loaded model and connects. No flags needed.

With HuggingFace

Terminal window
export HF_TOKEN=hf_...
swival --provider huggingface --model zai-org/GLM-5 "Fix the bug in auth.ts"

With local llama.cpp

Terminal window
# Start llama-server first
llama-server --reasoning auto --fit on \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
# Then connect
swival --provider llamacpp "Add input validation to signup endpoint"

With NVIDIA NIM

NVIDIA NIM exposes an OpenAI-compatible API. Point Swival’s generic provider at it:

Terminal window
export OPENAI_API_KEY="nvapi-..."
swival --provider generic \
--base-url https://integrate.api.nvidia.com/v1 \
--model qwen/qwen3-coder-480b-a35b-instruct \
"Add input validation to the signup endpoint"

Or pass the key inline with --api-key:

Terminal window
swival --provider generic \
--base-url https://integrate.api.nvidia.com/v1 \
--api-key "nvapi-..." \
--model qwen/qwen3-coder-480b-a35b-instruct \
"task"

Tested: NIM works via the generic provider. Note that the generic provider reads OPENAI_API_KEY for authentication, not NVIDIA_API_KEY.

Known Incompatibility: z.ai

z.ai uses a non-standard API path (/api/coding/paas/v4) that doesn’t follow the OpenAI convention. Swival’s generic provider auto-appends /v1 to the base URL, which produces an incorrect endpoint. As of this writing, z.ai cannot be used with Swival.

NIM: Rate Limit Issues

NVIDIA NIM technically works via the generic provider, but Swival’s agent loop makes multiple API calls per task (tool calls, compaction, retries). This quickly exhausts NIM’s free tier rate quota. Unless you have a paid NIM deployment or generous quota, it’s not practical for real use.

Generic Provider Caveat: openai/ Prefix

Swival’s generic provider automatically prepends openai/ to the model name. So if you pass --model glm-5-turbo, Swival sends openai/glm-5-turbo to the API. This is intentional for routing within Swival’s internal logic, but it breaks cloud providers that don’t recognize the openai/ prefix — making agentic tool calling unusable for many cloud LLMs. If your provider rejects prefixed model names, there’s no workaround as of this writing.

With generic OpenAI-compatible server

Terminal window
swival --provider generic \
--base-url http://127.0.0.1:8080 \
--model my-model \
"task description"

Interactive REPL

Terminal window
swival

The REPL carries conversation history across turns, good for exploratory work.

Stdin piping

Terminal window
swival -q < objective.md
cat prompts/review.md | swival --provider huggingface --model zai-org/GLM-5

CLI-native design: stdout is exclusively the final answer, all diagnostics go to stderr. Pipe output straight into files or other commands.

Deep Dive: Context Management

This is where Swival diverges most from other agents. The entire system is designed around one reality: small models have tight context windows, and context management is the agent’s job, not the model’s.

The Four-Level Compaction Pipeline

When the context window fills up, Swival runs a graduated compaction pipeline:

  1. Shrink old tool results — truncate large file reads and command outputs from earlier turns
  2. Drop low-value turns — score each turn by importance, drop the least valuable ones
  3. Nuclear drop — keep only the last two turns
  4. Shed tool schemas — remove tool definitions entirely, relying on the model’s prior knowledge

Each level only fires if the previous one wasn’t enough. This prevents over-aggressive compaction on tasks that don’t need it.

Knowledge That Survives Compaction

This is the single most important design decision. Three things live outside the message history:

  • Thinking notes — a think tool gives the model a structured scratchpad
  • Todo checklist — a todo tool tracks work items
  • Snapshot summaries — a snapshot tool lets the agent compress its investigation into a summary

Even after the most aggressive compaction (level 4), the agent still has its reasoning chain, task list, and accumulated knowledge. It can lose every message and keep working.

Bounded Tool Output

Hard limits prevent single tool calls from blowing the context budget:

  • File reads capped at 50KB
  • Grep returns at most 100 matches
  • Command output over 10KB saved to temp file, replaced with a pointer
  • MCP tool schemas that would eat more than half the context window are dropped at startup

Forgiving Parsers

  • Tool-call parsing uses multi-pass recovery — if JSON is slightly broken, Swival tries to fix it
  • The edit tool uses three-pass matching: exact, line-trimmed, unicode-normalized
  • These add up to fewer stalled loops with smaller models

Error Guardrails

  • Same error twice: warn the model
  • Same error three times: tell it to stop and try something different
  • Prevents small models from burning their entire context budget on a loop

Other Features

Review loop and LLM-as-judge. Configurable review loop that runs external reviewer scripts or uses a built-in LLM-as-judge to evaluate and retry output. Good for quality assurance on critical tasks.

Benchmarking reports. --report report.json writes machine-readable evaluation data: per-call LLM timing, tool success/failure counts, context compaction events, and guardrail interventions.

Secret encryption. --encrypt-secrets transparently detects API keys in LLM messages and encrypts them before they leave your machine. The LLM never sees real values. Decryption happens locally when responses come back.

Cross-session memory. Stores notes in a local memory file, retrieves relevant entries using BM25 ranking. Use /learn in the REPL to teach it something on the spot.

Session resume. When interrupted (Ctrl+C, max turns, context overflow), state saves to disk. Next run in the same directory picks up where it left off.

A2A server mode. swival --serve makes the agent an Agent-to-Agent endpoint other agents can call over HTTP. Multi-turn context, streaming, rate limiting, bearer auth built in.

Skills, MCP, and A2A. SKILL.md-based skills for reusable workflows, Model Context Protocol for external tools, Agent-to-Agent protocol for remote agent communication.

Prompt caching. Automatically marks the system message as cacheable for providers that support it (Anthropic, Gemini, Bedrock). Typically saves 30-60% of input token costs.

Swival vs Hermes Agent

FeatureSwivalHermes Agent
LanguagePure PythonPython + Node.js (browser tool)
Installuv tool install swivalGit clone + setup
Primary targetSmall/local models (8B-35B, 16K-32K ctx)Frontier models (128K+ ctx)
Default providerLM Studio (auto-discover)Configurable (Anthropic, OpenAI, etc.)
Context compaction4-level graduated pipelineAuto compression
Durable statethink/todo/snapshot survive compactionMemory tool (DB-backed)
SkillsSKILL.md-basedSKILL.md-based
MCPYesYes
A2ABuilt-in server mode (--serve)Via gateway/platforms
MemoryBM25-ranked local fileSQLite FTS5 + fact_store
Secret encryptionBuilt-in (--encrypt-secrets)No
Benchmarking--report report.json with telemetryNo built-in
Review loopExternal scripts + LLM-as-judgeNo
ChatGPT Plus/ProBrowser OAuth (no API key)No
AWS BedrockNative providerVia OpenAI-compatible
Command providerWrap any CLI as backendACP adapter (Claude Code, etc.)
PlatformsCLI onlyCLI + Telegram, Discord, Slack, WhatsApp, Signal, Home Assistant
SubagentsNoYes (delegate_task)
Background processesNoYes
Cron jobsNoYes
Browser automationVia MCP (Chrome DevTools, Lightpanda, agent-browser)Built-in browser tool
Code executionCommand toolSandbox execute_code
CLI designstdout = answer only, stderr = diagnosticsRich terminal UI

When to use Swival:

  • Running local models with tight context windows
  • You want zero-config with LM Studio
  • Benchmarking model performance on coding tasks
  • You need secret encryption for API keys
  • CLI piping workflows (swival -q < task.md | jq .)
  • Using ChatGPT Plus/Pro without a separate API key

When to use Hermes:

  • You need multi-platform delivery (Telegram, Discord, etc.)
  • Subagent orchestration and parallel workstreams
  • Background processes and cron scheduling
  • Built-in browser automation
  • Richer tool ecosystem (3000+ tests, vision, sandbox execution)

They’re complementary more than competing. Swival excels at the “run a local model on a coding task and get reliable output” problem. Hermes is a full agent platform with messaging, scheduling, and multi-agent coordination.

Profile Configuration

If you switch between providers regularly, use profiles:

~/.config/swival/config.toml
[profiles.local]
provider = "lmstudio"
model = "qwen3-coder-next"
[profiles.gpt5]
provider = "chatgpt"
model = "gpt-5.4"
reasoning_effort = "high"
[profiles.hf]
provider = "huggingface"
model = "zai-org/GLM-5"
Terminal window
swival --profile local "quick task"
swival --profile gpt5 "hard task"

Switch mid-session in the REPL:

swival> /profile gpt5

Final Thoughts

Swival fills a real gap for local coding. Its entire architecture — 4-level compaction, durable state, bounded outputs, forgiving parsers — is built around the constraint that context is scarce and model outputs are imperfect. This is exactly the reality when running 8B-35B models locally.

Where Swival falls short is with cloud API providers. In our testing:

  • z.ai — incompatible. Non-standard URL path (/api/coding/paas/v4) breaks Swival’s auto-appended /v1.
  • NVIDIA NIM — technically works, but Swival’s multi-turn agent loop burns through NIM’s rate quota fast. Not practical on free tiers.
  • Cloud LLMs via generic provider — Swival auto-prepends openai/ to model names, breaking providers that don’t recognize the prefix. Agentic tool calling becomes unusable.

The common thread: Swival’s generic provider isn’t truly generic. Between the forced openai/ prefix, the auto-appended /v1 on URLs, and the chatty agent loop, cloud API usage is fragile at best. This is fine when inference is local and free (LM Studio, llama.cpp, ollama). It becomes a problem when you’re paying per token or have rate limits.

Verdict: Swival is best suited for local coding with local models. If you have a GPU and run Qwen3, Gemma, or similar models locally, Swival is purpose-built for that workflow — zero config with LM Studio, auto-discovery, and context management that keeps small models on track. For cloud-based coding with API providers, agents like Hermes, Claude Code, or Codex are better suited due to their optimized prompt caching, fewer round-trips, and provider-native integrations.

At ~45 GitHub stars and 575 commits from a respected security engineer, it’s early but clearly serious. If you run local models for coding tasks and find other agents unreliable, Swival is worth a try.

Links: