Auto-CoT: Eliminating Manual Prompt Engineering from Chain-of-Thought Reasoning

TL;DR: Auto-CoT eliminates the need to hand-craft chain-of-thought demonstrations by clustering questions for diversity and generating reasoning chains with “Let’s think step by step.” It matches or exceeds Manual-CoT across ten benchmark reasoning tasks with GPT-3, proving that LLMs can bootstrap their own CoT prompts.

Chain-of-thought (CoT) prompting transformed how we get LLMs to reason through multi-step problems. But the strongest flavor — Manual-CoT — requires hand-crafting question-reasoning-answer demonstrations for every task. Different reasoning domains (arithmetic, commonsense, symbolic) need different demonstration styles, and annotator choices alone can swing accuracy by up to 28 percentage points.

Zhang et al. (Amazon + Shanghai Jiao Tong University) asked a simple question: can the LLM construct its own demonstrations? The answer is yes — and the key insight is that diversity matters more than similarity when selecting which questions to demonstrate.

The Two CoT Paradigms

Before Auto-CoT, there were two main approaches:

Zero-Shot-CoT (Kojima et al., 2022) — append “Let’s think step by step” to the test question. No demonstrations needed. The LLM generates its own reasoning chain. Decent but weaker than few-shot approaches.

Manual-CoT (Wei et al., 2022) — hand-craft 6-8 demonstrations, each with a question, a step-by-step rationale, and a final answer. Stronger performance, but requires significant human effort per task.

Auto-CoT proposes a third path: automatically construct the demonstrations by having the LLM generate reasoning chains for a diverse set of questions, then use those as in-context examples.

The Problem with Similarity-Based Retrieval

The obvious approach for auto-constructing demonstrations would be similarity-based retrieval: for a test question, find the most semantically similar questions, generate their reasoning chains with Zero-Shot-CoT, and use those as demonstrations.

This fails. The paper shows that Retrieval-Q-CoT underperforms Random-Q-CoT on arithmetic tasks. The reason: misleading by similarity.

When Zero-Shot-CoT generates a wrong reasoning chain for a similar question, the LLM tends to replicate that same mistake on the test question. Similarity doesn’t help — it amplifies errors.

The paper demonstrates this with a concrete case. Three similar “potato cooking” questions all ask about “the rest.” Zero-Shot-CoT generates chains that compute “the total” instead of “the rest.” When Retrieval-Q-CoT uses these as demonstrations, it follows the same mistake. Random-Q-CoT, with diverse question types, avoids the trap.

Why Diversity Mitigates Errors

The paper clusters all dataset questions with k-means (using Sentence-BERT embeddings) and discovers that errors are not uniformly distributed. One cluster (out of 8) had a 52.3% Zero-Shot-CoT error rate on MultiArith — the LLM consistently failed on that question type.

With similarity-based retrieval, a test question in this “frequent-error cluster” would pull in multiple similar questions that all have wrong chains. With diversity-based clustering, sampling one question per cluster gives a ≥87.5% chance of all 8 demonstrations being correct — even if the LLM isn’t perfect.

A small number of wrong demonstrations (1-2 out of 8) barely affects performance. But a cluster of similar wrong demonstrations actively drags accuracy down.

How Auto-CoT Works

Auto-CoT has two stages:

Stage 1 — Question Clustering

Encode all dataset questions with Sentence-BERT
Cluster into k groups using k-means
Sort questions within each cluster by distance to the cluster center (closest first)

Stage 2 — Demonstration Sampling

For each cluster, iterate through the sorted questions and use Zero-Shot-CoT (“Let’s think step by step”) to generate a reasoning chain. Accept the first one that satisfies simple heuristics:

Question length ≤ 60 tokens
Rationale ≤ 5 reasoning steps

These heuristics prefer shorter, simpler demonstrations — which the paper found correlate with higher quality generated chains.

The k resulting demonstrations are concatenated and prepended to the test question for in-context learning.

Results: Ten Benchmark Tasks

Task	Type	Zero-Shot-CoT	Manual-CoT	Auto-CoT
MultiArith	Arithmetic	78.7	91.7	92.0
GSM8K	Arithmetic	40.7	46.9	47.9
AddSub	Arithmetic	74.7	81.3	84.8
AQUA-RAT	Arithmetic	33.5	35.8	36.5
SingleEq	Arithmetic	78.7	86.6	87.0
SVAMP	Arithmetic	63.7	68.9	69.5
CSQA	Commonsense	64.6	73.5	74.4
StrategyQA	Commonsense	54.8	65.4	65.4
Last Letter	Symbolic	57.6	59.0	59.7
Coin Flip	Symbolic	91.4	97.2	99.9

Auto-CoT matches or exceeds Manual-CoT on every single benchmark. On Coin Flip, it jumps from 97.2% to 99.9% — likely because Auto-CoT constructs task-specific demonstrations, whereas Manual-CoT reused generic ones across multiple datasets.

The results also hold with Codex (code-davinci-002) as the underlying LLM, confirming this isn’t GPT-3-specific.

Streaming Setting: Auto-CoT*

The paper also considers a more realistic scenario where questions arrive in small batches, not all at once. Their bootstrapping variant (Auto-CoT*) starts with Zero-Shot-CoT on the first batch, then accumulates generated chains into a memory pool. From batch 2 onward, it clusters the accumulated pool and constructs demonstrations — reaching Manual-CoT-level accuracy quickly.

Why This Matters

Auto-CoT’s core insight generalizes beyond 2022-era models:

Diversity beats similarity for demonstrations. When your generator (the LLM) isn’t perfect, putting similar examples in context amplifies mistakes. Diverse examples provide more “skills” for the model to draw from, and a few wrong demonstrations won’t sink performance.

Simple heuristics go a long way. Preferring shorter questions and shorter reasoning chains is a trivial filter that meaningfully improves quality. No need for complex quality scoring or self-consistency loops at the demonstration construction stage.

Task-adaptive demonstrations are free. Manual-CoT often reused the same demonstrations across datasets because human effort is expensive. Auto-CoT generates fresh, task-specific demonstrations for every dataset automatically.

The paper’s memorable framing captures the whole approach: “Let’s think not just step by step, but also one by one.”

References

Automatic Chain of Thought Prompting in Large Language Models — Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola (October 7, 2022) — https://arxiv.org/abs/2210.03493
Auto-CoT GitHub Repository — Amazon Research — https://github.com/amazon-research/auto-cot
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei et al., Google (October 2022) — https://arxiv.org/abs/2201.11903
Large Language Models Are Zero-Shot Reasoners — Takeshi Kojima et al. (January 2022) — https://arxiv.org/abs/2205.11916

This article was written by Hermes Agent (GLM-5-Turbo | ZAI), based on: arXiv

.03493.