Hermes Agent: Auxiliary Model Routing and Background Token Costs

TL;DR: Hermes Agent does more than call your main chat model. It also runs eight background auxiliary tasks, and if you leave them on expensive defaults, context compression alone can quietly dominate your token bill. The practical fix is to route high-frequency background work, especially compression, to a cheaper or local model while keeping the main model for user-facing reasoning.

Most people think about Hermes pricing in terms of the main chat model they selected: Claude Opus, Sonnet, GPT-5, or something local. The video from Onchain AI Garage makes a more useful point: a large share of cost can come from the models you are not thinking about at all.

Hermes has a separate auxiliary layer for background work such as compression, memory flushing, web extraction, and vision. Those tasks are small individually, but they can fire often enough that they matter more than one expensive foreground call. If you are doing long coding sessions, research-heavy browsing, or image-driven workflows, this is where optimization starts paying off.

The Eight Auxiliary Tasks Matter More Than They Look

The video frames Hermes as running eight “hidden” model slots in the background. The ones called out explicitly are:

compression
flush_memories
web_extract
vision
session_search
skills_hub
mcp
approval

Not all of these cost the same. The important point is not just that they exist, but that they can be configured independently from the main model. That changes the optimization problem completely. You do not need your most expensive frontier model deciding whether a command looks risky or summarizing an internal session search result.

What Each Auxiliary Task Actually Does

The names are useful once you already know Hermes internals, but they are not very friendly if you are seeing them for the first time. Here is the practical interpretation of each one.

`compression`

This is Hermes’ context summarizer. When a conversation gets long enough to hit the configured compression threshold, Hermes summarizes older parts of the session so it can keep working without overflowing the context window.

This is usually the biggest cost driver because it can fire repeatedly during long coding or research sessions. It does not usually need your best reasoning model. It needs a model that can handle long input cleanly and produce reliable summaries.

`flush_memories`

This runs when Hermes ends a session and decides what should be written into long-term memory. In other words, it looks back at the conversation and extracts durable facts, preferences, and useful outcomes.

This is more like structured extraction than deep reasoning. A fast, cheaper model is often enough as long as it is consistent.

`web_extract`

This is the model Hermes uses to turn fetched web pages into usable summaries or extracted text. If you use Hermes for research, documentation reading, or browsing-heavy workflows, this slot matters a lot.

This task benefits from factual preservation more than raw intelligence. A model that summarizes faithfully is usually better here than a very expensive model that is overkill for extraction.

`vision`

This handles image analysis: screenshots, browser screenshots, and image inputs. If Hermes is using visual tools, this is the model that needs to understand what is on the screen.

This one is different from the others because it must be multimodal. A text-only model cannot do this job. If you automate websites or inspect screenshots often, making this slot explicit is a good idea.

`session_search`

This is the background model used when Hermes looks through past sessions and summarizes the results. It helps answer questions like: “Have we worked on this before?” or “Which previous session is relevant?”

This is typically a summarization-and-ranking task, not a hard reasoning task. Speed is usually more important than frontier-level depth.

`skills_hub`

This is Hermes’ skill matcher. When Hermes sees your request, skills_hub helps determine whether one of the installed skills is relevant and which skill instructions should be loaded.

The easiest way to think about it is: skills_hub answers, “Which playbook should I use?” That means it is mostly a routing problem. In most setups, it does not need an expensive heavyweight model.

`mcp`

This auxiliary slot supports MCP-related tool dispatch. MCP gives Hermes access to external tool servers, and this layer helps the agent decide how to route or interpret MCP tool usage in the background.

Again, this is usually closer to tool routing and structured interpretation than premium reasoning.

`approval`

This is the model used for smart approval flows. When Hermes is deciding whether a command looks risky enough to ask the user first, this classifier helps separate obviously safe actions from potentially destructive ones.

This is almost never where you want to spend premium tokens. It is a judgment and classification task, so low latency and predictable behavior matter more than top-end reasoning.

A Simpler Mental Model

If the list still feels abstract, this translation is easier to remember:

Hermes task	Plain-English question
`compression`	“How do I shrink this conversation without losing the thread?”
`flush_memories`	“What from this session is worth remembering later?”
`web_extract`	“What does this webpage actually say?”
`vision`	“What is in this image or screenshot?”
`session_search`	“Which old session matters here?”
`skills_hub`	“Which skill or workflow should I use?”
`mcp`	“How should I use this external tool server?”
`approval`	“Is this action risky enough to ask first?”

Where the Spend Actually Lives

The video’s ranking is straightforward:

Auxiliary task	Why it costs money
`compression`	Fires whenever conversation context hits the compression threshold
`flush_memories`	Runs at session end to summarize and store memories
`web_extract`	Runs when Hermes summarizes fetched pages
`vision`	Runs on image analysis and browser screenshot interpretation

Compression is the big one. That makes sense operationally: if you use Hermes for long software engineering sessions, the context window fills up over and over, and Hermes has to summarize old context so the session can keep going. A single compression pass may not look expensive, but repeated passes across a day add up fast.

The video’s demo uses roughly a 50,000-token compression pass and compares Claude Opus 4.6 with Kimi K2 through OpenRouter. The reported result is about $0.13 versus $0.02 for the same compression job, or roughly 85% savings per pass. Once the creator projects that across 10 to 20 compressions per day, the background spend stops looking trivial.

The Useful Mental Model: Foreground vs Background Models

The cleanest way to think about Hermes configuration is this:

Your main model is for user-facing reasoning and tool use
Your auxiliary models are for background support tasks

Those background tasks have different requirements. Compression needs long-context summarization. Memory flushing needs fast, structured extraction. Web extraction benefits from factual preservation. Vision obviously needs multimodal support. If you route all of them through the same premium model, you are paying for capability you often do not need.

The creator’s recommendations follow that pattern:

For compression: use a model that handles long context and produces clean summaries
For flush_memories: use something fast and structured
For web_extract: bias toward factual preservation
For vision: pick a multimodal model explicitly if needed

That is a better configuration strategy than choosing one “best” model globally.

The Config Detail That Is Easy to Miss

One of the most useful parts of the video is the warning about compression config. Hermes has a top-level compression: block for behavior settings like threshold and tail preservation, but the summarization model itself lives under auxiliary.compression.

Current Hermes docs describe it like this:

1
compression:
2
  enabled: true
3
  threshold: 0.50
4
  target_ratio: 0.20
5
  protect_last_n: 20
6

7
auxiliary:
8
  compression:
9
    provider: "openrouter"
10
    model: "moonshotai/kimi-k2"

The practical rule is simple: if auxiliary.compression is set, that is the model Hermes uses for compression summarization. This is the part worth checking first if you thought you had changed compression behavior but your costs did not move.

The Universal Pattern Hermes Uses

One thing I like about the newer Hermes config is that it uses the same shape almost everywhere:

provider
model
base_url
api_key
timeout

That means once you understand one auxiliary slot, you understand most of them. In practice:

Use provider when you want Hermes to route through a built-in provider like OpenRouter or Nous
Use model when you want a specific model on that provider
Use base_url when you want to bypass the provider abstraction and hit an OpenAI-compatible endpoint directly
Use api_key only when that direct endpoint needs its own key
Use timeout when the task is valid but your model is slower than the default budget

This matters because the real unlock is not just “pick a cheaper model.” It is “assign the right execution shape to the right job.”

A Full Auxiliary Block Is Worth Reading Once

The official docs now expose the full auxiliary surface, which is more granular than the video alone suggests:

1
auxiliary:
2
  vision:
3
    provider: "auto"
4
    model: ""
5
    base_url: ""
6
    api_key: ""
7
    timeout: 30
8
    download_timeout: 30
9
  web_extract:
10
    provider: "auto"
11
    model: ""
12
    base_url: ""
13
    api_key: ""
14
    timeout: 360
15
  approval:
16
    provider: "auto"
17
    model: ""
18
    base_url: ""
19
    api_key: ""
20
    timeout: 30
21
  compression:
22
    provider: "auto"
23
    model: ""
24
    base_url: ""
25
    api_key: ""
26
    timeout: 120
27
  session_search:
28
    provider: "auto"
29
    model: ""
30
    base_url: ""
31
    api_key: ""
32
    timeout: 30
33
  skills_hub:
34
    provider: "auto"
35
    model: ""
36
    base_url: ""
37
    api_key: ""
38
    timeout: 30
39
  mcp:
40
    provider: "auto"
41
    model: ""
42
    base_url: ""
43
    api_key: ""
44
    timeout: 30
45
  flush_memories:
46
    provider: "auto"
47
    model: ""
48
    base_url: ""
49
    api_key: ""
50
    timeout: 30

That is the important architectural shift. Hermes is no longer “one model plus some hidden magic.” It is much closer to a configurable routing layer where each background operation can have its own cost, latency, and capability profile.

Example 1: Cheap-By-Default Coding Setup

If your normal workflow is heavy coding with a premium main model, this is probably the highest-value starting point:

1
compression:
2
  enabled: true
3
  threshold: 0.50
4
  target_ratio: 0.20
5
  protect_last_n: 20
6

7
auxiliary:
8
  compression:
9
    provider: "openrouter"
10
    model: "moonshotai/kimi-k2"
11
    timeout: 120
12
  flush_memories:
13
    provider: "openrouter"
14
    model: "openai/gpt-4o-mini"
15
  approval:
16
    provider: "openrouter"
17
    model: "google/gemini-2.5-flash"
18
  session_search:
19
    provider: "openrouter"
20
    model: "google/gemini-2.5-flash"
21
  skills_hub:
22
    provider: "openrouter"
23
    model: "google/gemini-2.5-flash"
24
  mcp:
25
    provider: "openrouter"
26
    model: "google/gemini-2.5-flash"

This setup preserves the expensive model for your actual conversation while routing recurring support work to cheaper models. If your sessions are long, this alone can change the monthly bill materially.

Example 2: Research-Heavy Hermes Setup

If you use Hermes as a research agent, web_extract becomes much more important than in a pure coding workflow:

1
auxiliary:
2
  compression:
3
    provider: "openrouter"
4
    model: "moonshotai/kimi-k2"
5
  web_extract:
6
    provider: "openrouter"
7
    model: "anthropic/claude-3.5-haiku"
8
    timeout: 360
9
  flush_memories:
10
    provider: "openrouter"
11
    model: "openai/gpt-4o-mini"
12
  session_search:
13
    provider: "openrouter"
14
    model: "google/gemini-2.5-flash"

The idea here is simple: compression still gets optimized for cost, but web summarization gets a model chosen for factual preservation and clean extraction. If the agent is reading lots of pages per day, this split makes more sense than sending everything to one heavyweight reasoning model.

Example 3: Vision-Heavy Browser Automation Setup

If Hermes spends time looking at screenshots or images, make vision explicit instead of leaving it to whatever your main model happens to be:

1
auxiliary:
2
  vision:
3
    provider: "openrouter"
4
    model: "openai/gpt-4o"
5
    timeout: 45
6
    download_timeout: 45
7
  web_extract:
8
    provider: "openrouter"
9
    model: "google/gemini-2.5-flash"
10
  approval:
11
    provider: "openrouter"
12
    model: "google/gemini-2.5-flash"

That gives you a dedicated multimodal slot for image understanding while leaving text-only jobs on cheaper models.

Example 4: Local-First Auxiliary Routing

Hermes also supports direct OpenAI-compatible endpoints through base_url. That means you can push auxiliary work to a local server without changing your main cloud model.

1
auxiliary:
2
  compression:
3
    base_url: "http://localhost:1234/v1"
4
    model: "qwen2.5-14b-instruct"
5
    api_key: "local-key"
6
    timeout: 180
7
  flush_memories:
8
    base_url: "http://localhost:1234/v1"
9
    model: "qwen2.5-7b-instruct"
10
    api_key: "local-key"
11
    timeout: 60

This is one of the most interesting parts of the current Hermes design. You can keep a strong remote frontier model for the main loop, but move repetitive summarization tasks to local infrastructure where the marginal cost is effectively zero. The tradeoff is latency and reliability, which is why the timeout values matter more in local setups.

Environment Overrides Are Useful for Quick Experiments

Hermes also supports environment overrides for at least some auxiliary tasks. For example:

1
AUXILIARY_VISION_MODEL=openai/gpt-4o
2
AUXILIARY_WEB_EXTRACT_MODEL=google/gemini-2.5-flash

That is handy when you want to test one task in isolation without fully rewriting config.yaml. It is not enough for every routing scenario, but it is a fast way to answer practical questions like: “Does GPT-4o actually improve browser screenshot interpretation enough to justify the cost?”

A More Opinionated Routing Strategy

If I were configuring Hermes today, I would not think in terms of “best model overall.” I would think in terms of four buckets:

Bucket	Hermes tasks	What matters most
Long-context summarization	`compression`	Cheap long-context handling and clean summaries
Short structured extraction	`flush_memories`, `session_search`, `skills_hub`, `mcp`, `approval`	Low latency, predictable formatting
Factual page understanding	`web_extract`	Faithful extraction and low hallucination risk
Multimodal interpretation	`vision`	Reliable image and screenshot reasoning

That framing makes the waste in an all-Opus or all-GPT-5 setup obvious. You are paying frontier prices for many operations that are closer to classification, summarization, or extraction.

Why This Is More Than a Cost Hack

This is not only about spending less. It is also about matching tasks to the right model shape.

Using a lighter model for approval, session_search, or flush_memories can be the right engineering choice even if money were irrelevant, because those tasks want speed and consistency more than frontier-level reasoning. Likewise, a dedicated multimodal model for vision is often more appropriate than forcing the main model to cover everything.

The other path the video highlights is local inference. If your hardware can run a capable local model for some auxiliary tasks, the marginal cost for those tasks can drop to zero. That does not mean “run everything locally.” It means background summarization and support work are often the first places where local models make economic sense.

What to Change First in a Real Hermes Setup

If you want the highest-leverage tweak, start with auxiliary.compression.

That one change hits the task that fires most often in long sessions and therefore gives you the largest immediate return. After that, look at:

web_extract if you use Hermes heavily for research
vision if you do browser automation or screenshot-heavy workflows
flush_memories if you start and end many sessions per day

If you currently run a heavyweight model for everything, the practical migration path is:

Keep the main model unchanged
Move only auxiliary.compression first
Watch cost and latency for a few days
Then split out web_extract and vision
Only after that, experiment with local endpoints for repeated low-risk tasks

That order matters because it gives you the biggest savings with the least configuration risk.

The broader lesson is useful beyond Hermes itself: agent cost is often dominated by support loops, not just the main completion. Once a coding agent starts compressing context, summarizing pages, routing skills, and classifying commands, its “background model architecture” matters just as much as the model you chose in the chat UI.

References

How Hermes Agent Can Save You 85% (Or More) in BG Task Token Cost — Onchain AI Garage (April 17, 2026) — https://www.youtube.com/watch?v=NoF-YajElIM
Configuration — Hermes Agent Docs, Nous Research — https://hermes-agent.nousresearch.com/docs/user-guide/configuration/
Context Compression and Caching — Hermes Agent Docs, Nous Research — https://hermes-agent.nousresearch.com/docs/developer-guide/context-compression-and-caching
Kimi K2 — OpenRouter — https://openrouter.ai/moonshotai/kimi-k2

This article was written by Codex (GPT-5.4 | OpenAI), based on content from: https://www.youtube.com/watch?v=NoF-YajElIM