TL;DR: Hermes Agent does more than call your main chat model. It also runs eight background auxiliary tasks, and if you leave them on expensive defaults, context compression alone can quietly dominate your token bill. The practical fix is to route high-frequency background work, especially compression, to a cheaper or local model while keeping the main model for user-facing reasoning.
Most people think about Hermes pricing in terms of the main chat model they selected: Claude Opus, Sonnet, GPT-5, or something local. The video from Onchain AI Garage makes a more useful point: a large share of cost can come from the models you are not thinking about at all.
Hermes has a separate auxiliary layer for background work such as compression, memory flushing, web extraction, and vision. Those tasks are small individually, but they can fire often enough that they matter more than one expensive foreground call. If you are doing long coding sessions, research-heavy browsing, or image-driven workflows, this is where optimization starts paying off.
The Eight Auxiliary Tasks Matter More Than They Look
The video frames Hermes as running eight “hidden” model slots in the background. The ones called out explicitly are:
compressionflush_memoriesweb_extractvisionsession_searchskills_hubmcpapproval
Not all of these cost the same. The important point is not just that they exist, but that they can be configured independently from the main model. That changes the optimization problem completely. You do not need your most expensive frontier model deciding whether a command looks risky or summarizing an internal session search result.
What Each Auxiliary Task Actually Does
The names are useful once you already know Hermes internals, but they are not very friendly if you are seeing them for the first time. Here is the practical interpretation of each one.
compression
This is Hermes’ context summarizer. When a conversation gets long enough to hit the configured compression threshold, Hermes summarizes older parts of the session so it can keep working without overflowing the context window.
This is usually the biggest cost driver because it can fire repeatedly during long coding or research sessions. It does not usually need your best reasoning model. It needs a model that can handle long input cleanly and produce reliable summaries.
flush_memories
This runs when Hermes ends a session and decides what should be written into long-term memory. In other words, it looks back at the conversation and extracts durable facts, preferences, and useful outcomes.
This is more like structured extraction than deep reasoning. A fast, cheaper model is often enough as long as it is consistent.
web_extract
This is the model Hermes uses to turn fetched web pages into usable summaries or extracted text. If you use Hermes for research, documentation reading, or browsing-heavy workflows, this slot matters a lot.
This task benefits from factual preservation more than raw intelligence. A model that summarizes faithfully is usually better here than a very expensive model that is overkill for extraction.
vision
This handles image analysis: screenshots, browser screenshots, and image inputs. If Hermes is using visual tools, this is the model that needs to understand what is on the screen.
This one is different from the others because it must be multimodal. A text-only model cannot do this job. If you automate websites or inspect screenshots often, making this slot explicit is a good idea.
session_search
This is the background model used when Hermes looks through past sessions and summarizes the results. It helps answer questions like: “Have we worked on this before?” or “Which previous session is relevant?”
This is typically a summarization-and-ranking task, not a hard reasoning task. Speed is usually more important than frontier-level depth.
skills_hub
This is Hermes’ skill matcher. When Hermes sees your request, skills_hub helps determine whether one of the installed skills is relevant and which skill instructions should be loaded.
The easiest way to think about it is: skills_hub answers, “Which playbook should I use?” That means it is mostly a routing problem. In most setups, it does not need an expensive heavyweight model.
mcp
This auxiliary slot supports MCP-related tool dispatch. MCP gives Hermes access to external tool servers, and this layer helps the agent decide how to route or interpret MCP tool usage in the background.
Again, this is usually closer to tool routing and structured interpretation than premium reasoning.
approval
This is the model used for smart approval flows. When Hermes is deciding whether a command looks risky enough to ask the user first, this classifier helps separate obviously safe actions from potentially destructive ones.
This is almost never where you want to spend premium tokens. It is a judgment and classification task, so low latency and predictable behavior matter more than top-end reasoning.
A Simpler Mental Model
If the list still feels abstract, this translation is easier to remember:
| Hermes task | Plain-English question |
|---|---|
compression | “How do I shrink this conversation without losing the thread?” |
flush_memories | “What from this session is worth remembering later?” |
web_extract | “What does this webpage actually say?” |
vision | “What is in this image or screenshot?” |
session_search | “Which old session matters here?” |
skills_hub | “Which skill or workflow should I use?” |
mcp | “How should I use this external tool server?” |
approval | “Is this action risky enough to ask first?” |
Where the Spend Actually Lives
The video’s ranking is straightforward:
| Auxiliary task | Why it costs money |
|---|---|
compression | Fires whenever conversation context hits the compression threshold |
flush_memories | Runs at session end to summarize and store memories |
web_extract | Runs when Hermes summarizes fetched pages |
vision | Runs on image analysis and browser screenshot interpretation |
Compression is the big one. That makes sense operationally: if you use Hermes for long software engineering sessions, the context window fills up over and over, and Hermes has to summarize old context so the session can keep going. A single compression pass may not look expensive, but repeated passes across a day add up fast.
The video’s demo uses roughly a 50,000-token compression pass and compares Claude Opus 4.6 with Kimi K2 through OpenRouter. The reported result is about $0.13 versus $0.02 for the same compression job, or roughly 85% savings per pass. Once the creator projects that across 10 to 20 compressions per day, the background spend stops looking trivial.
The Useful Mental Model: Foreground vs Background Models
The cleanest way to think about Hermes configuration is this:
- Your main model is for user-facing reasoning and tool use
- Your auxiliary models are for background support tasks
Those background tasks have different requirements. Compression needs long-context summarization. Memory flushing needs fast, structured extraction. Web extraction benefits from factual preservation. Vision obviously needs multimodal support. If you route all of them through the same premium model, you are paying for capability you often do not need.
The creator’s recommendations follow that pattern:
- For
compression: use a model that handles long context and produces clean summaries - For
flush_memories: use something fast and structured - For
web_extract: bias toward factual preservation - For
vision: pick a multimodal model explicitly if needed
That is a better configuration strategy than choosing one “best” model globally.
The Config Detail That Is Easy to Miss
One of the most useful parts of the video is the warning about compression config. Hermes has a top-level compression: block for behavior settings like threshold and tail preservation, but the summarization model itself lives under auxiliary.compression.
Current Hermes docs describe it like this:
compression: enabled: true threshold: 0.50 target_ratio: 0.20 protect_last_n: 20
auxiliary: compression: provider: "openrouter" model: "moonshotai/kimi-k2"The practical rule is simple: if auxiliary.compression is set, that is the model Hermes uses for compression summarization. This is the part worth checking first if you thought you had changed compression behavior but your costs did not move.
The Universal Pattern Hermes Uses
One thing I like about the newer Hermes config is that it uses the same shape almost everywhere:
providermodelbase_urlapi_keytimeout
That means once you understand one auxiliary slot, you understand most of them. In practice:
- Use
providerwhen you want Hermes to route through a built-in provider like OpenRouter or Nous - Use
modelwhen you want a specific model on that provider - Use
base_urlwhen you want to bypass the provider abstraction and hit an OpenAI-compatible endpoint directly - Use
api_keyonly when that direct endpoint needs its own key - Use
timeoutwhen the task is valid but your model is slower than the default budget
This matters because the real unlock is not just “pick a cheaper model.” It is “assign the right execution shape to the right job.”
A Full Auxiliary Block Is Worth Reading Once
The official docs now expose the full auxiliary surface, which is more granular than the video alone suggests:
auxiliary: vision: provider: "auto" model: "" base_url: "" api_key: "" timeout: 30 download_timeout: 30 web_extract: provider: "auto" model: "" base_url: "" api_key: "" timeout: 360 approval: provider: "auto" model: "" base_url: "" api_key: "" timeout: 30 compression: provider: "auto" model: "" base_url: "" api_key: "" timeout: 120 session_search: provider: "auto" model: "" base_url: "" api_key: "" timeout: 30 skills_hub: provider: "auto" model: "" base_url: "" api_key: "" timeout: 30 mcp: provider: "auto" model: "" base_url: "" api_key: "" timeout: 30 flush_memories: provider: "auto" model: "" base_url: "" api_key: "" timeout: 30That is the important architectural shift. Hermes is no longer “one model plus some hidden magic.” It is much closer to a configurable routing layer where each background operation can have its own cost, latency, and capability profile.
Example 1: Cheap-By-Default Coding Setup
If your normal workflow is heavy coding with a premium main model, this is probably the highest-value starting point:
compression: enabled: true threshold: 0.50 target_ratio: 0.20 protect_last_n: 20
auxiliary: compression: provider: "openrouter" model: "moonshotai/kimi-k2" timeout: 120 flush_memories: provider: "openrouter" model: "openai/gpt-4o-mini" approval: provider: "openrouter" model: "google/gemini-2.5-flash" session_search: provider: "openrouter" model: "google/gemini-2.5-flash" skills_hub: provider: "openrouter" model: "google/gemini-2.5-flash" mcp: provider: "openrouter" model: "google/gemini-2.5-flash"This setup preserves the expensive model for your actual conversation while routing recurring support work to cheaper models. If your sessions are long, this alone can change the monthly bill materially.
Example 2: Research-Heavy Hermes Setup
If you use Hermes as a research agent, web_extract becomes much more important than in a pure coding workflow:
auxiliary: compression: provider: "openrouter" model: "moonshotai/kimi-k2" web_extract: provider: "openrouter" model: "anthropic/claude-3.5-haiku" timeout: 360 flush_memories: provider: "openrouter" model: "openai/gpt-4o-mini" session_search: provider: "openrouter" model: "google/gemini-2.5-flash"The idea here is simple: compression still gets optimized for cost, but web summarization gets a model chosen for factual preservation and clean extraction. If the agent is reading lots of pages per day, this split makes more sense than sending everything to one heavyweight reasoning model.
Example 3: Vision-Heavy Browser Automation Setup
If Hermes spends time looking at screenshots or images, make vision explicit instead of leaving it to whatever your main model happens to be:
auxiliary: vision: provider: "openrouter" model: "openai/gpt-4o" timeout: 45 download_timeout: 45 web_extract: provider: "openrouter" model: "google/gemini-2.5-flash" approval: provider: "openrouter" model: "google/gemini-2.5-flash"That gives you a dedicated multimodal slot for image understanding while leaving text-only jobs on cheaper models.
Example 4: Local-First Auxiliary Routing
Hermes also supports direct OpenAI-compatible endpoints through base_url. That means you can push auxiliary work to a local server without changing your main cloud model.
auxiliary: compression: base_url: "http://localhost:1234/v1" model: "qwen2.5-14b-instruct" api_key: "local-key" timeout: 180 flush_memories: base_url: "http://localhost:1234/v1" model: "qwen2.5-7b-instruct" api_key: "local-key" timeout: 60This is one of the most interesting parts of the current Hermes design. You can keep a strong remote frontier model for the main loop, but move repetitive summarization tasks to local infrastructure where the marginal cost is effectively zero. The tradeoff is latency and reliability, which is why the timeout values matter more in local setups.
Environment Overrides Are Useful for Quick Experiments
Hermes also supports environment overrides for at least some auxiliary tasks. For example:
AUXILIARY_VISION_MODEL=openai/gpt-4oAUXILIARY_WEB_EXTRACT_MODEL=google/gemini-2.5-flashThat is handy when you want to test one task in isolation without fully rewriting config.yaml. It is not enough for every routing scenario, but it is a fast way to answer practical questions like: “Does GPT-4o actually improve browser screenshot interpretation enough to justify the cost?”
A More Opinionated Routing Strategy
If I were configuring Hermes today, I would not think in terms of “best model overall.” I would think in terms of four buckets:
| Bucket | Hermes tasks | What matters most |
|---|---|---|
| Long-context summarization | compression | Cheap long-context handling and clean summaries |
| Short structured extraction | flush_memories, session_search, skills_hub, mcp, approval | Low latency, predictable formatting |
| Factual page understanding | web_extract | Faithful extraction and low hallucination risk |
| Multimodal interpretation | vision | Reliable image and screenshot reasoning |
That framing makes the waste in an all-Opus or all-GPT-5 setup obvious. You are paying frontier prices for many operations that are closer to classification, summarization, or extraction.
Why This Is More Than a Cost Hack
This is not only about spending less. It is also about matching tasks to the right model shape.
Using a lighter model for approval, session_search, or flush_memories can be the right engineering choice even if money were irrelevant, because those tasks want speed and consistency more than frontier-level reasoning. Likewise, a dedicated multimodal model for vision is often more appropriate than forcing the main model to cover everything.
The other path the video highlights is local inference. If your hardware can run a capable local model for some auxiliary tasks, the marginal cost for those tasks can drop to zero. That does not mean “run everything locally.” It means background summarization and support work are often the first places where local models make economic sense.
What to Change First in a Real Hermes Setup
If you want the highest-leverage tweak, start with auxiliary.compression.
That one change hits the task that fires most often in long sessions and therefore gives you the largest immediate return. After that, look at:
web_extractif you use Hermes heavily for researchvisionif you do browser automation or screenshot-heavy workflowsflush_memoriesif you start and end many sessions per day
If you currently run a heavyweight model for everything, the practical migration path is:
- Keep the main model unchanged
- Move only
auxiliary.compressionfirst - Watch cost and latency for a few days
- Then split out
web_extractandvision - Only after that, experiment with local endpoints for repeated low-risk tasks
That order matters because it gives you the biggest savings with the least configuration risk.
The broader lesson is useful beyond Hermes itself: agent cost is often dominated by support loops, not just the main completion. Once a coding agent starts compressing context, summarizing pages, routing skills, and classifying commands, its “background model architecture” matters just as much as the model you chose in the chat UI.
References
- How Hermes Agent Can Save You 85% (Or More) in BG Task Token Cost — Onchain AI Garage (April 17, 2026) — https://www.youtube.com/watch?v=NoF-YajElIM
- Configuration — Hermes Agent Docs, Nous Research — https://hermes-agent.nousresearch.com/docs/user-guide/configuration/
- Context Compression and Caching — Hermes Agent Docs, Nous Research — https://hermes-agent.nousresearch.com/docs/developer-guide/context-compression-and-caching
- Kimi K2 — OpenRouter — https://openrouter.ai/moonshotai/kimi-k2
This article was written by Codex (GPT-5.4 | OpenAI), based on content from: https://www.youtube.com/watch?v=NoF-YajElIM
