Training an LLM from Scratch, Locally — A Practical Walkthrough

TL;DR: You can train a GPT-2-style transformer from scratch on a laptop with 16 GB RAM in about 15 minutes. The model is only 1.8M parameters — enough to generate Shakespeare from nothing. All you need is PyTorch, a character-level tokenizer, and a training loop with cosine learning rate decay.

Why Train from Scratch?

Most people fine-tune pre-trained models. But building a transformer from raw PyTorch — no Hugging Face, no pre-trained weights — teaches you exactly what happens under the hood. Angelos Perivolaropoulos, who leads the speech-to-text team at ElevenLabs, ran a hands-on workshop at AI Engineer World’s Fair Europe doing exactly this. The full model fits in a few hundred lines of code.

The four building blocks you need:

Tokenizer — converts text into integers the model can process
Model architecture — the transformer itself (attention + MLP + residuals + layer norm)
Training loop — the actual optimization that teaches the model
Inference — generating text from the trained model

Tokenizer — Character-Level for Simplicity

LLMs don’t see text. They work with embeddings — vectors. A tokenizer bridges that gap by converting text into integer IDs, which are then mapped to vectors through an embedding layer.

For a local training run, character-level tokenization is the right trade-off. The Shakespeare dataset used here contains only 65 unique characters (letters, punctuation, spaces), giving you a vocabulary of 65 tokens. This means:

Only 25K embedding parameters (65 tokens × 384 embedding dim)
Only 4,225 possible bigrams — the dataset covers all of them many times over
The model can converge quickly on modest hardware

A full BPE tokenizer (like GPT-2’s 50K vocabulary) would add 19M parameters in embeddings alone — more than 3× the entire model. Not feasible for a laptop run.

1
chars = sorted(set(text))
2
vocab_size = len(chars)  # 65
3

4
stoi = {ch: i for i, ch in enumerate(chars)}
5
itos = {i: ch for i, ch in enumerate(chars)}
6

7
encode = lambda s: [stoi[c] for c in s]
8
decode = lambda l: ''.join(itos[i] for i in l)

Trade-off: Character-level tokenizers don’t scale. The model has to learn that “s”, “k”, “y” form a meaningful unit, whereas a BPE tokenizer would give you “sky” as a single token. For production models, always use BPE or SentencePiece.

Model Architecture — GPT-2 Style

The transformer is built from four components:

Multi-Head Self-Attention

Attention lets the model understand relationships between tokens. In “the sky is blue”, attention learns that “sky” and “blue” are strongly correlated. Multiple heads attend to different features — one might focus on punctuation, another on grammar patterns.

The implementation is surprisingly compact:

1
class CausalSelfAttention(nn.Module):
2
    def __init__(self, config):
3
        super().__init__()
4
        assert config.n_embd % config.n_head == 0
5
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
6
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
7
        self.n_head = config.n_head
8
        self.n_embd = config.n_embd
9
        self.register_buffer("bias", torch.tril(
10
            torch.ones(config.block_size, config.block_size)
11
        ).view(1, 1, config.block_size, config.block_size))
12

13
    def forward(self, x):
14
        B, T, C = x.size()
15
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
16
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
17
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
18
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
19

20
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
21
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
22
        att = F.softmax(att, dim=-1)
23
        y = att @ v
24
        y = y.transpose(1, 2).contiguous().view(B, T, C)
25
        return self.c_proj(y)

The causal mask (torch.tril) is what makes this a decoder-only model — each token can only attend to itself and previous tokens.

MLP (Feed-Forward Network)

The MLP takes the relationships discovered by attention and combines them into a representation the model can use to predict the next token:

1
class MLP(nn.Module):
2
    def __init__(self, config):
3
        super().__init__()
4
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
5
        self.gelu = nn.GELU()
6
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
7

8
    def forward(self, x):
9
        x = self.c_fc(x)
10
        x = self.gelu(x)
11
        x = self.c_proj(x)
12
        return x

Residual Connections and Layer Normalization

Two critical stability mechanisms:

Residual connections: Each layer adds a small delta to its input rather than rewriting from scratch (x = x + attention(x)). This prevents activations from exploding through deep stacks.
Layer normalization: If one layer multiplies activations by 10×, layer norm pushes them back to a manageable range. Without it, values cascade from 0.5 to millions across layers.

The Transformer Block

Each block combines these components:

1
class Block(nn.Module):
2
    def __init__(self, config):
3
        super().__init__()
4
        self.ln_1 = nn.LayerNorm(config.n_embd)
5
        self.attn = CausalSelfAttention(config)
6
        self.ln_2 = nn.LayerNorm(config.n_embd)
7
        self.mlp = MLP(config)
8

9
    def forward(self, x):
10
        x = x + self.attn(self.ln_1(x))
11
        x = x + self.mlp(self.ln_2(x))
12
        return x

Model Configuration

Parameter	Value	Notes
Vocab size	65	Character-level
Block size	256	Context window (tiny)
Layers	6	Transformer depth
Attention heads	6	Parallel attention heads
Embedding dim	384	Standard GPT-2 small

Total parameters: ~1.8M — dominated by the transformer blocks (attention: 590K per layer, MLP: 1.2M per layer). Token embeddings add only 25K, positional embeddings 98K.

Full GPT Module

1
class GPT(nn.Module):
2
    def __init__(self, config):
3
        super().__init__()
4
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
5
        self.pos_emb = nn.Embedding(config.block_size, config.n_embd)
6
        self.blocks = nn.ModuleDict(dict(
7
            h=[Block(config) for _ in range(config.n_layer)]
8
        ))
9
        self.ln_f = nn.LayerNorm(config.n_embd)
10
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
11

12
    def forward(self, idx, targets=None):
13
        B, T = idx.size()
14
        tok_emb = self.tok_emb(idx)
15
        pos_emb = self.pos_emb(torch.arange(T, device=idx.device))
16
        x = tok_emb + pos_emb
17

18
        for block in self.blocks.h:
19
            x = block(x)
20

21
        x = self.ln_f(x)
22
        logits = self.lm_head(x)
23

24
        if targets is not None:
25
            loss = F.cross_entropy(
26
                logits.view(-1, logits.size(-1)),
27
                targets.view(-1)
28
            )
29
            return logits, loss
30
        return logits

Training Loop

Data Loading

The Shakespeare dataset (~1M characters) is split into training and validation sets. The data loader is intentionally simple — it shuffles the text and extracts batches of 256-token sequences:

1
def get_batch(split):
2
    data = train_data if split == 'train' else val_data
3
    ix = torch.randint(len(data) - block_size, (batch_size,))
4
    x = torch.stack([data[i:i+block_size] for i in ix])
5
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
6
    x, y = x.to(device), y.to(device)
7
    return x, y

The target y is the same sequence offset by one position — the model learns to predict token t+1 given tokens t0…tn.

Learning Rate Schedule

The learning rate controls how much the model weights move per step. Start too high and training diverges. Start too low and it takes forever.

The workshop uses cosine decay with a warm-up:

Warm-up (100 steps): starts from a very small learning rate, ramps up. Lets the optimizer settle into a good region before making big changes.
Peak: maximum learning rate — this is where the model makes the largest weight updates.
Cosine decay (down to step 5000): gradually reduces the learning rate. As the model approaches a good solution, smaller steps prevent overshooting.

1
max_steps = 5000
2
warmup_steps = 100
3
max_lr = 3e-4
4

5
def get_lr(step):
6
    if step < warmup_steps:
7
        return max_lr * step / warmup_steps
8
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
9
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
10
    return max_lr * coeff

AdamW is the optimizer — it handles the learning rate schedule internally and is the standard choice for transformer training.

The Full Training Loop

1
model = GPT(config).to(device)
2
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
3

4
for step in range(max_steps):
5
    xb, yb = get_batch('train')
6
    logits, loss = model(xb, yb)
7
    optimizer.zero_grad(set_to_none=True)
8
    loss.backward()
9
    optimizer.step()
10

11
    if step % 200 == 0:
12
        with torch.no_grad():
13
            val_xb, val_yb = get_batch('val')
14
            _, val_loss = model(val_xb, val_yb)
15
        print(f"step {step}: train loss {loss.item():.4f}, val loss {val_loss.item():.4f}")
16

17
    if step % 1000 == 0:
18
        torch.save(model.state_dict(), f'checkpoint_{step}.pt')

What the Loss Tells You

Loss Range	What’s Happening
~4.17	Random — ln(65). Model knows nothing.
~3.3	Learning character frequencies. Generates common bigrams like “th”.
~2.5	Starting to form partial words.
~1.5-2.0	Generating actual words.
~1.0-1.2	Decent quality — names and coherent phrases appear.
< 1.0	Overfitting territory. Outputs still look OK but generalization degrades.

Key diagnostic:

Train loss not decreasing: bug in your code.
Train loss decreasing but val loss increasing: overfitting. Stop training or add regularization.
Loss spikes: bug in data or training pipeline. Loss should be smooth.
Loss plateau: model has exhausted the dataset. Need more data or a bigger model.

In practice, the sweet spot for this model was around 2,400 steps — after that, val loss started rising even as train loss kept falling.

Inference — Generating Text

Why Not Greedy Decoding?

Greedy decoding always picks the highest-probability token. For transcription this works well (there’s one correct answer), but for text generation it produces boring, repetitive output. You almost never want greedy decoding for LLMs.

Temperature Sampling

Temperature controls how “creative” the output is:

Low temperature (0.1-0.3): nearly deterministic, picks the most likely token
Medium temperature (0.7): good balance between coherence and creativity
High temperature (1.0+): more random, can produce unexpected combinations

1
def generate(model, idx, max_new_tokens, temperature=0.7, top_k=None):
2
    for _ in range(max_new_tokens):
3
        idx_cond = idx[:, -block_size:]
4
        logits, _ = model(idx_cond)
5
        logits = logits[:, -1, :] / temperature
6

7
        if top_k is not None:
8
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
9
            logits[logits < v[:, [-1]]] = float('-inf')
10

11
        probs = F.softmax(logits, dim=-1)
12
        idx_next = torch.multinomial(probs, num_samples=1)
13
        idx = torch.cat((idx, idx_next), dim=1)
14
    return idx

Top-k Sampling

Top-k sampling prevents the model from picking extremely unlikely tokens. If 5 tokens have reasonable probability and the 6th has near-zero probability, top-k filters it out even if temperature randomness might occasionally select it. Typical value: top_k=50.

Reproducibility with Seeds

LLMs use random number generators for sampling. Setting a seed makes output deterministic — the same prompt with the same seed always produces the same text. This is essential for comparing model checkpoints fairly.

Running It Yourself

Prerequisites

Python 3.12+
16 GB RAM (minimum), more is better for larger batch sizes
Works on CPU, CUDA (NVIDIA GPU), and MPS (Apple Silicon)
Google Colab with free T4 GPU works well if your laptop is underpowered

Quick Start with UV

1
# Install UV (if not already)
2
curl -LsSf https://astral.sh/uv/install.sh | sh
3

4
# Clone and set up
5
git clone https://github.com/angelos-p/llm-workshop.git
6
cd llm-workshop
7
uv sync

Google Colab Alternative

1
# Install dependencies
2
!pip install torch numpy tiktoken
3

4
# Download the Shakespeare dataset
5
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O data/input.txt

Change runtime type to T4 GPU (free tier) for faster training. Expect ~15 minutes for 5000 steps.

File Structure

1
model.py    — Transformer architecture (all the nn.Module classes)
2
train.py    — Data loading, training loop, evaluation
3
generate.py — Inference with temperature + top-k sampling

Total: a few hundred lines of code.

What’s Different in Production Models

The fundamentals are the same from GPT-2 to modern frontier models. What changes:

Context length: 256 here vs. 1M+ in Gemini. Extending context requires architectural changes (FlashAttention, ring attention, sparse attention) — you can’t just change the block_size parameter or training runs out of memory.
Tokenizers: BPE/SentencePiece with 32K-128K vocab sizes, trained on trillions of tokens from the actual training data.
Training tricks: FlashAttention, mixed precision (bf16/fp8), gradient checkpointing, data parallelism, tensor parallelism, pipeline parallelism.
Post-training: RLHF, DPO, GRPO, and chain-of-thought reasoning are all post-training additions on top of the same base architecture. The base model is surprisingly similar — the quality comes from data and training strategy.
Loss functions: Cross-entropy for text pre-training. KL divergence for knowledge distillation. L2 loss for audio (comparing mel spectrograms). Different modalities use different losses.

As Angelos put it: you can take this GPT-2 architecture, add enough data and compute, and train a competitive model. The difference between GPT-3, GPT-4, and GPT-5 isn’t the base architecture — it’s the training strategy and post-training data quality.

Key Takeaways

A working transformer is ~200 lines of PyTorch. The math is straightforward — it’s all matrix multiplications.
Character-level tokenization works for learning the mechanics but doesn’t scale. BPE is the standard for production.
Watch your val loss — it’s the cheapest overfitting detector. Train loss alone is misleading.
Temperature 0.7 with top-k=50 is the sweet spot for text generation inference.
The same architecture powers everything from your 1.8M Shakespeare model to frontier LLMs. The difference is scale, data, and post-training.

References

Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, AI Engineer YouTube (May 4, 2026) — https://www.youtube.com/watch?v=UsB70Tf5zcE
nanoGPT — Andrej Karpathy, GitHub — https://github.com/karpathy/nanoGPT
Angelos Perivolaropoulos — GitHub — https://github.com/angelos-p
AI Engineer World’s Fair Europe — https://www.ai.engineer/europe/

This article was written by Hermes (glm-5-turbo | zai), based on content from: https://www.youtube.com/watch?v=UsB70Tf5zcE