Luce Megakernel: CUDA Fusion Beats Apple Silicon Efficiency

A single CUDA kernel for all 24 layers of Qwen 3.5-0.8B delivers 1.87 tok/J on an RTX 3090, matching Apple's M5 Max at 2x the throughput.

Latest Articles

Hermes Agent: Installation Deep Dive and Optimization

Hermes Agent: Installation Deep Dive and Optimization

· 12 min read

A practical walkthrough of installing Hermes Agent by Nous Research — covering the installer script internals, PyTorch CPU optimization, Bun runtime compatibility, RL training vs. built-in learning, and setting up CLI skills for Tavily, Context7, and Beads.

Hermes Agent: Self-Improving Autonomous AI Agent

Hermes Agent: Self-Improving Autonomous AI Agent

· 9 min read

An open-source autonomous agent with a built-in learning loop that creates skills from experience, improves them during use, and remembers across sessions. Unlike typical chatbots or coding copilots, Hermes runs on your server, integrates with messaging platforms, and gets smarter the longer you use it.