technical
petite-vllm · llm-serving · vllm

petite-vllm

Building an LLM Serving Engine from Scratch

Why am I building this?

I use vLLM daily in my work as an ML Performance Engineer, and find more and more that I would benefit from a deeper understanding of its internals. Reading the code is helpful, but for me, the real learning happens when I start implementing things from scratch :).

Thus petite-vllm is the result. It's a from-scratch LLM serving engine, built one concept at a time, using Qwen3-0.6B as the main example model. This project is setup in a series of lessons, for which I used Claude to help structure the lesson plan, and scaffold some of the code. petite-vllm is heavily influenced by the nano-vllm project, but intentionally broken down into byte-sized chunks for progressively building up to each concept.

The lessons

Each lesson incrementally builds on the next, and Ill be sharing the code and a short writeup of what I learned as a post for each.

  1. Autoregressive Generation — Build the Qwen3 model architecture by hand, load pretrained weights, generate tokens one at a time. Painfully slow, and that's the point.

  2. The KV Cache — Split generation into prefill and decode. Cache K and V tensors so we stop recomputing the full sequence every step. The single most important optimization in LLM serving.

  3. Paged KV Cache — Treat KV cache memory like virtual memory. Allocate in fixed-size blocks, map logical positions to physical slots. vLLM's key innovation.

  4. Continuous Batching — Sequences enter and exit the batch independently. No more waiting for the slowest request. Add chunked prefill so long prompts don't block decode.

  5. Triton Kernels & CUDA Graphs — Write custom GPU kernels in Python. Fused RMSNorm, paged attention decode, CUDA graph capture. Where the performance engineering gets real.

  6. Roofline Analysis & Profiling — No new features. Just the math: arithmetic intensity, ridge points, compute-bound vs memory-bound. The quantitative reasoning interviewers expect you to do live.

  7. Prefix Caching — Reuse KV cache across requests with shared prefixes. Content-addressable blocks with reference counting.

  8. Speculative Decoding — Draft tokens with a small model, verify in parallel with the big one. The acceptance/rejection math, and when it helps vs hurts.

  9. Structured Output — Constrain generation to valid JSON via finite state machines and logit masking.

  10. Model Parallelism — Tensor parallelism (shard within a layer) and pipeline parallelism (shard across layers). Column-parallel, row-parallel, all-reduce.

  11. Quantization — INT8 weight-only and weight+activation quantization from scratch. Per-channel scaling, calibration, and what breaks when you quantize too aggressively.

  12. Gated Delta Networks — Beyond transformers. Qwen 3.5/3.6 replaces the KV cache with a fixed-size recurrent state. How that changes everything about serving.

  13. API Server — Wrap it all in a FastAPI server with SSE streaming. The capstone.