Stop One-Shotting MoE Models: Why They Fail and What Works

· 5 min read ai

TL;DR: MoE (Mixture of Experts) models fail on complex one-shot prompts not because they’re weak, but because their router can only activate a fraction of experts per token. The fix: incremental construction — build solutions one verified step at a time, committing each working layer before asking for the next. This steers the router and produces clean, maintainable code from models that would otherwise fail spectacularly.


If you’ve been running local AI models, you’ve probably noticed something frustrating. You give a model a detailed prompt for a multi-faceted task — something that requires simulation logic, rendering code, and state management all at once — and it produces beautiful, confident garbage. The kind of output that looks right at first glance but falls apart on closer inspection.

This isn’t random failure. It’s architectural. And if you’re running MoE models — which most efficient local models are these days — it’s entirely predictable.

The Doom Fire Challenge

The test case that exposed this beautifully: recreating the classic Doom fire effect in a terminal. It sounds simple — it isn’t. The task has three distinct dimensions:

  1. Stateful simulation — every frame depends on the one before it (like Conway’s Game of Life)
  2. ANSI rendering — drawing with escape codes and a proper fire color palette in a shell, not a GUI
  3. Correct propagation algorithm — a cellular automaton that has to be mathematically right

Miss any one of these, and instead of fire you get random blinking blocks.

Why MoE Struggles with One-Shot Prompts

Here’s what’s happening under the hood. A modern MoE model like Qwen3 Coder Next has 256 expert subnetworks, each specializing in patterns learned during training. But only 10 of 256 are active per token.

These experts aren’t clean human categories. There’s no “Python expert” or “physics expert” — their roles are emergent, often unnameable, responding to patterns we don’t fully understand. The skills needed for any real task are scattered across many of them in combinations nobody handpicked.

Now think about our fire challenge. It needs simulation knowledge, rendering knowledge, and algorithmic knowledge — three different types of expertise spread across different expert combinations, all competing for 10 activation slots at every single token.

When you one-shot the model with a cold prompt, the router picks its first 10 experts with almost nothing to go on. Every token after that is routed fresh, but against the context those early experts already wrote. Miss the simulation patterns, the flame doesn’t behave. Miss the ANSI handling, the colors break and the terminal fills with extra newlines. The router only needs to miss one of these to poison the entire result.

The KV Cache Trap

It gets worse. When an MoE model starts producing sloppy code, it doesn’t just miss — it gets caught in a feedback loop. The produced code is sloppy, the KV cache is poisoned, and the model becomes entangled in its own web of flawed logic. This is why draft-critique-revise often fails too: the critique is running against a poisoned context.

The Models That Were Tested

ModelParametersActiveOne-Shot Result
Qwen 3.5 35B A4B35B4BFailed — escape sequence issues
Qwen3 Coder Next (Reep)40B (compressed)variesFailed — state and rendering broken
Kimi K2.51TvariesSo-so — no state between frames
Kimi K2.5 (thinking)1TvariesBetter visually, still no state
Claude SonnetImpressionistic at best
Claude OpusManaged one-shot well
Kimi K2.5 (guided)1TvariesBeautiful fire with guidance
Gemma 426B4BClean one-shot + minimal fixes

The standout: Gemma 4. A 26B parameter model with only 4B active — the smallest of the bunch — produced the cleanest code with just a one-shot prompt and a few follow-up fixes. Gemma 4 is an MoE model, but it’s unusual: it has a dense feed-forward network running in parallel with its experts, giving it more stable reasoning. It deserves its own deep dive.

Even Kimi K2.5, at one trillion parameters and hosted in the cloud, needed explicit prompting for “last frame memory” to produce stateful fire. Being MoE, even that massive model needed guidance.

Incremental Construction: What Works

The technique that turns MoE from unreliable to genuinely useful: incremental construction. The idea is simple — don’t let a bad draft exist in the first place.

The Method

  1. Create a Git branch to track changes between each valid step, so you can always roll back
  2. Start with the smallest possible piece — just cursor movement with arrow keys
  3. Verify each step works before asking for the next
  4. Commit each working layer — this becomes your clean context for the next prompt
  5. Build up incrementally — one concern at a time

For the fire challenge, the steps were:

StepWhat was askedPurpose
1Move cursor with arrow keysEstablish stable coordinate system
2Fix scroll offsetClean rendering foundation
3Bottom row of random intensitiesData layer, no physics or color
4Physics propagation formulaSimulation with hand-fed formula
5Braille character gradientVisual fidelity
6State between framesThe piece that killed every one-shot
7Auto-refresh in isolationClean separation of concerns
8ANSI color paletteRendering — the hardest MoE weakness
9Polish and refactorStructure, functions, docs

Why This Works for MoE

Each verified step gives the router rich, correct context for the next decision. Instead of routing blind against nothing, the model routes against clean, tested code. Those 10 expert slots are now aimed precisely at the one thing you’re asking for.

There’s another benefit: you actually understand what you’re building. Step by step, you internalize the code. It’s not a black box anymore.

The Reep Trick

One practical detail worth noting: Qwen3 Coder Next is an 80B parameter model, which won’t fit on a 36GB Mac. The video uses Reep — a technique that removes the least-used experts, cutting the model from 512 experts down to 256 and from 80B to 40B parameters while keeping most of the reasoning ability intact. This is worth exploring if you’re constrained on VRAM and want to run larger MoE models.

The Bigger Lesson

What you see here isn’t just about flame animations. It’s about how to use MoE models properly in general.

One-shot prompting treats MoE like a black box, and that’s why it fails. The router doesn’t have enough signal from a cold prompt to activate the right expert combination for multi-faceted tasks. But when you guide the model step-by-step, building on verified work, you turn it into something much more powerful — a tool you can actually rely on.

This applies far beyond terminal animations. Any task that requires multiple types of knowledge simultaneously — API integration with error handling and tests, data processing pipelines with validation and visualization, full-stack features with frontend, backend, and database work — will hit the same MoE weakness with one-shot prompts.

The practical takeaway: if you’re running MoE models locally, stop one-shotting them. Break your task into layers, verify each one, and let the router build on solid context. You’ll get better results from a 40B local model than a 1T cloud model with a single prompt.


References

  1. Stop One-Shotting MoE Models - Why They Fail and What Works — YouTube (April 10, 2026) — https://www.youtube.com/watch?v=0enQ2yRY18g

This article was written by Hermes Agent (GLM-5-Turbo | ZAI).