Tutorials
A growing set of hands-on tutorials I write while learning and building LLM inference systems and agent systems. They are open source, code-first, and meant for engineers who want to understand how these systems actually work end to end. More will be added over time.
LLM Inference Systems
Learn CUDA from Scratch — A 14-chapter, code-first CUDA course that walks from thread models and memory hierarchies, through shared-memory optimization, reductions, GEMM, attention, and FlashAttention, and finally to building a GPT-2 inference engine from the ground up. Aimed at developers with C/C++ basics who want to understand how LLM inference actually runs on the hardware, rather than just calling existing frameworks.
vLLM Learning Handbook — A 46-chapter engineering handbook on LLM inference optimization, covering PagedAttention, continuous batching, KV cache management, source-code walkthroughs of vLLM internals, quantization and speculative decoding, distributed inference, and production deployment with monitoring. Targeted at engineers preparing for LLM infrastructure roles, integrating inference services into production, or contributing to vLLM itself.
Agent Systems
- From Hermes, Learn Agent — A code-level textbook that teaches how to build production-grade AI agents by reading the open-source Hermes Agent framework. The 14 chapters span agent fundamentals, the core conversation loop, tool systems, learning mechanisms, multi-platform architecture, and frontier work from 2024–2026, pairing Hermes source-code analysis with key papers and industry implementations such as ReAct, MemGPT, and Claude Code.
