petite-vllm

Building an LLM Serving Engine from Scratch

May 24, 2026 · 2 min read

Why am I building this?

I use vLLM daily in my work as an ML Performance Engineer, and find more and more that I would benefit from a deeper understanding of its internals. Reading the code is helpful, but for me, the real learning happens when I start implementing things from scratch :).

Thus petite-vllm is the result. It's a from-scratch LLM serving engine, built one concept at a time, using Qwen3-0.6B as the main example model. This project is setup in a series of parts, for which I used Claude to help structure the lesson plan, and scaffold some of the code. petite-vllm is heavily influenced by the nano-vllm project, but intentionally broken down into byte-sized chunks for progressively building up to each concept.

Structure

Each part incrementally builds on the next, and Ill be sharing the code and a writeup of what I learned as a post for each. The planned modules so far are:

Autoregressive Generation — Build the Qwen3 model architecture by hand, load pretrained weights, generate tokens one at a time. Painfully slow, and that's the point.
The KV Cache — Implement KV caching. We'll split generation into prefill and decode, and cache the K and V tensors so we stop recomputing the full sequence every step.
Paged Attention — We'll build up from a flat KV cache implementation to the Block KV cache that vLLM is known for.
Continuous Batching — Enable sequences to enter and exit the batch independently. Once one sequence completes that frees up a slot for a new sequence to enter the batch.
API Server — We now have a fully functioning, minimal serving engine. We'll wrap it into a fast API layer and explore ideas such as streaming, concurrency, and scheduling.
Triton Kernels & CUDA Graphs — This is where we start playing with GPUs. We'll use Triton to write custom kernels to improve our serving engine's performance.
Model Parallelism — From one GPU to many (or, as many as I can afford :D)! We'll grab a bigger model, and take a look at different techniques to shard the model across multiple GPUs