Local-ai Articles | Learning thru AI

Gemma 4 12B: MTP Speculative Decoding and RAG for Faster Local Inference

Jun 14, 2026 · 5 min read

How Gemma 4 12B combines encoder-free multimodal design, MTP speculative decoding, and RAG to run OCR and document Q&A on consumer hardware.

local-ai rag

Nex-N2 Agentic Models — Benchmarks, Nex-AGI Origins, and Running Locally

Jun 12, 2026 · 8 min read

A deep dive into the Nex-N2 model family from China's SII-backed Nex-AGI alliance. Benchmarks vs GPT 5.5 and Opus 4.7, who's behind the startup, and how to run the 35B mini variant on an RTX 5090.

ai local-ai

BeeLlama.cpp + RTX 5090: 32 GB of DFlash Sweet Spot

Jun 9, 2026 · 10 min read

BeeLlama.cpp is a llama.cpp fork that adds DFlash speculative decoding, TurboQuant/TCQ, MTP, and more. How it works, the RTX 5090's 32 GB advantage, and what we can expect in tok/s on the latest Blackwell hardware.

local-ai ai

Qwen3.6 27B for Local Coding on RTX 5090 — User Experience and Setup

Jun 8, 2026 · 1 min read

Real-world tests of Qwen3.6 27B on RTX 5090 with 32 GB VRAM — benchmarks, llama.cpp/vLLM/sglang setups, NVFP4 + MTP tuning recipe, and the CUDA 13.2 gotcha that produces gibberish.

local-ai ai

Gemma 4 12B QAT for Local Coding — User Experience and Setup

Jun 8, 2026 · 9 min read

Real-world tests of Gemma 4 12B Quantization-Aware Training for coding tasks — benchmarks, llama.cpp/vLLM/sglang setups, VRAM requirements, and the critical pitfalls nobody mentions.

local-ai ai

Qwen3.6 27B: From 20 t/s to 184 t/s — Full Optimization Pipeline

Jun 7, 2026 · 14 min read

Comprehensive benchmark and optimization of Qwen3.6-27B on RTX 4090 through quantization, MTP speculative decoding, DFlash diffusion-based acceleration, DDTree branching, and TurboQuant KV cache compression.

local-ai youtube

Local LLMs: Fixing Infinite Repetition Loops in Open-Weight Models

Jun 7, 2026 · 4 min read

A deep dive into why local language models get trapped in infinite repetition loops and how to fix them using optimized sampler configurations.

local-ai

Small LLMs with RAG, Context7, and Agent Memory: Building a Local Coding Agent

Jun 5, 2026 · 8 min read

How to build a capable local coding agent using small LLMs (Qwen3 8B, Gemma3 12B) augmented with Context7 for up-to-date documentation, semantic RAG for project context, and agent memory frameworks for persistent knowledge.

ai local-ai

llama.cpp: Run a 35B MoE Model on 6GB VRAM — 5 Flags That Matter

Jun 2, 2026 · 5 min read

How to run Qwen 3.6 35B-A3B on a GTX 1060 with just 6GB VRAM using llama.cpp, MoE offloading, and five critical flags — boosting speed from 3 to 17 tokens/sec.

local-ai youtube

Pi Coding Agent 0.78.0 — Installation Analysis

May 30, 2026 · 12 min read

A deep dive into my Pi Coding Agent setup: models, extensions, skills, agents, and configuration.

local-ai self-hosting

DFlash Speculative Decoding: 600 Tokens/sec on Single RTX 5090

May 25, 2026 · 8 min read

How block diffusion speculative decoding with DFlash, vLLM, and Gemma 4 26B MoE achieves 600 tokens per second on consumer GPU hardware.

local-ai youtube

Optimizing Local Qwen 3.6 27B with DFlash on an RTX 5090

May 24, 2026 · 5 min read

A complete guide to tuning beellama.cpp for maximum context, stable tool calling, and high-speed speculative decoding on a single 32GB GPU.

local-ai ai

GPU and Inference Engine Selection: vLLM, SGLang, TGI, and NIM Compared

May 19, 2026 · 5 min read

Practical guide to choosing GPUs and inference engines for LLM deployment — covering quantization, benchmarking vLLM vs SGLang vs TGI vs NIM, and cost analysis across A40 through H100.

local-ai youtube

Running Qwen3-Next-80B-A3B on Limited VRAM with Selective MoE Offloading

May 6, 2026 · 7 min read

Run the 80B MoE Qwen3-Next locally using llama.cpp with selective FFN layer offloading to CPU. Unsloth UD-Q4_K_XL quantization + regex-based -ot flag lets you maximize GPU usage while keeping MoE expert layers in system RAM.

local-ai youtube

DFlash on RTX 3090: 207 tok/s Qwen3.5-27B with Speculative Decoding

May 6, 2026 · 7 min read

Run Qwen3.5-27B at 3.43x autoregressive speed on a single RTX 3090. Lucebox's DFlash port brings block-diffusion speculative decoding to GGUF — build, download weights, and start generating in under 20 minutes.

local-ai speculative-decoding

Gemma 4 MTP Drafters: Speculative Decoding for Local LLMs

May 6, 2026 · 4 min read

Google released official MTP drafter models for Gemma 4. A small companion model guesses tokens ahead, the big model verifies — same quality, nearly 3x speed on the same hardware.

local-ai youtube

llama.cpp: Running 35B MoE on 6GB VRAM

May 5, 2026 · 1 min read

Five working tricks + one failed trick + one upcoming trick for running Qwen 3.6 35B on an 8-year-old GTX 1060.

local-ai youtube

Non-GPU AI Accelerators: The Post-NVIDIA Landscape

May 5, 2026 · 12 min read

A comprehensive survey of non-NVIDIA AI chips available today — TPUs, NPUs, custom ASICs, and wafer-scale engines — from AWS Trainium3 to Cerebras WSE-3, Google TPU v6 Trillium, Korea's NPU startups, and optical interconnect upstarts.

ai local-ai

Turbovec + OpenClaw + Ollama: Local RAG Agent with 8x TurboQuant Compression

Apr 22, 2026 · 4 min read

Turbovec achieves 8x memory compression for RAG embeddings via TurboQuant quantization, enabling fully local agentic workflows with OpenClaw and Ollama on consumer hardware.

local-ai ai

Tiny Language Models: Fast Local Models with Unsloth and Outlines

Apr 20, 2026 · 4 min read

A practical walkthrough of using structured synthetic data, Unsloth fine-tuning, and a simple harness to turn a tiny base model into a fast local specialist.

local-ai youtube

Luce Megakernel: CUDA Fusion Beats Apple Silicon Efficiency

Apr 17, 2026 · 6 min read

A single CUDA kernel for all 24 layers of Qwen 3.5-0.8B delivers 1.87 tok/J on an RTX 3090, matching Apple's M5 Max at 2x the throughput.

local-ai youtube

Local AI in the Wild: What Real Users Are Actually Running

Apr 15, 2026 · 12 min read

54 comments from developers running Gemma 4, Qwen 3.5, and other local models — the hardware, the benchmarks, the frustrations, and the wins.

local-ai youtube

Gemma 4 for Local OCR: Self-Hosted Document Processing with Ollama and TurboQuant

Apr 9, 2026 · 8 min read

How to use Gemma 4 as a local OCR engine — processing images and PDFs through Ollama with vision models, no cloud APIs needed. Covers the architecture, TurboQuant's impact on long-context document processing, and a practical Python implementation.

local-ai youtube

Z-Image-Turbo and Flux 2 Klein 4B: Local Image Generation on AMD iGPU and CPU with stable-diffusion.cpp

Apr 8, 2026 · 16 min read

Two of the newest distilled diffusion models — Z-Image-Turbo and Flux 2 Klein 4B — both run locally on AMD integrated graphics and CPU using stable-diffusion.cpp. No NVIDIA GPU required. We benchmark both on a Ryzen 5 PRO 4650U and show how they share the same text encoder to save disk space.

local-ai

RotorQuant and IsoQuant: Fixing Turbo Quant's Prefill Bottleneck with Clifford Algebra

Apr 7, 2026 · 7 min read

How RotorQuant replaces Turbo Quant's expensive 128x128 matrix rotation with Clifford algebra rotors — 44x fewer parameters, 10-19x faster on CUDA, matching attention fidelity on real models.

local-ai youtube

Google Turbo Quant: Theory, Dense vs MoE Context, and llama.cpp Benchmarks

Apr 7, 2026 · 7 min read

A deep dive into Google's Turbo Quant KV cache compression — from the theory of 3-bit compression vs 4-bit, through dense vs MoE context scaling experiments, to a full llama.cpp benchmark with FP16, Q4, and Turbo Quant head-to-head.

local-ai youtube

llama.cpp: Running LLMs on AMD Vega iGPU with Vulkan

Apr 7, 2026 · 9 min read

Getting llama.cpp to work on an AMD Ryzen 5 PRO 4650U with integrated Vega graphics — no NVIDIA, no CUDA, no ROCm. Just Mesa RADV and the Vulkan backend.

local-ai

FastSD CPU: Run Stable Diffusion on Any CPU — No GPU Required

Apr 7, 2026 · 7 min read

Generate images locally using Stable Diffusion on nothing but your CPU. FastSD CPU uses Latent Consistency Models and OpenVINO to produce 512x512 images in under a second on a modern processor — no $5,000 GPU needed.

local-ai

Local AI Hardware Guide: Why VRAM Matters More Than GPU Speed

Mar 28, 2026 · 4 min read

A practical guide to building local AI systems focused on VRAM—the key bottleneck for running AI models locally at usable speeds.

local-ai youtube

Run Data Center AI Accelerators in Your Workstation: A DIY Guide

Mar 23, 2026 · 8 min read

How to repurpose a Tesla V100 SXM2 AI accelerator from a DGX server into your home workstation for running local LLMs at a fraction of GPU costs.

local-ai