microgpt - GPT Training and Inference Implemented in 200 Lines of Pure Python | GeekNews
Key Points
- 1microgpt is a 200-line pure Python implementation of a complete GPT model, created by Karpathy as an "art project" to distill the core algorithmic essence of large language models to their irreducible minimum.
- 2This self-contained project includes all fundamental components: a character-level tokenizer, a custom scalar-unit autograd engine (Value class), a small GPT-2-like Transformer architecture, and an Adam optimizer, capable of training and inference on a dataset of 32,000 names.
- 3While demonstrating the full algorithmic loop from data ingestion to new name generation, microgpt highlights that the fundamental mathematics are identical to massive production LLMs, which differ mainly in scale, engineering optimizations (e.g., tensor-based autograd, larger models, BPE tokenizers, advanced optimization), and post-training processes like SFT and RLHF.
MicroGPT is a pure Python implementation of the core GPT algorithm, condensed into approximately 200 lines of code without external dependencies. Developed by Andrej Karpathy, it serves as an educational art project to demystify Large Language Models (LLMs) by presenting their "irreducible minimum" algorithmic essence. It builds upon previous projects like micrograd, makemore, and nanogpt, consolidating their concepts into a single file that includes dataset handling, tokenization, an autograd engine, a GPT-2-like Transformer architecture, an Adam optimizer, and full training and inference loops.
Core Methodology and Technical Details:
- Dataset and Tokenization:
- Dataset: MicroGPT uses a simple dataset of 32,000 names, with each name treated as a "document." The model learns statistical patterns within these names to generate new, plausible names.
- Tokenization: A character-level tokenizer is employed. Each unique lowercase letter (a-z) is assigned an integer ID, along with a special Beginning of Sequence (BOS) token. A name like "emma" is tokenized as
[BOS, e, m, m, a, BOS]. The total vocabulary size is 27 (26 letters + 1 BOS).
- Autograd (Automatic Differentiation):
- The core of training is computing gradients, implemented via a custom
Valueclass. - Each
Valueobject encapsulates a scalar floating-point number (.data) and maintains a record of the operation that produced it (_op), its inputValueobjects (_children), and the partial derivatives of that operation with respect to its inputs (_local_grads). - Supported operations include addition, multiplication, exponentiation, logarithm, exponential, and ReLU.
- The
backward()method is called on the final scalar lossValueobject. It initiates a gradient of1for the loss () and then traverses the computational graph in reverse topological order. At eachValuenode, it applies the chain rule, multiplying the incoming gradient (from its parent in the backward pass) by its_local_gradsto compute gradients for its_children. Gradients are accumulated using (not assignment) to correctly handle paths where aValuecontributes to the loss through multiple subsequent operations. This scalar-based autograd system is algorithmically identical to PyTorch'sbackward()but operates on individual scalars instead of tensors.
- The core of training is computing gradients, implemented via a custom
- Model Architecture:
- MicroGPT implements a simplified GPT-2 architecture. It is a stateless function that takes tokens, position information, parameters, and cached keys/values to output logits for the next token.
- Hyperparameters: (embedding dimension), (attention heads), (number of layers), (maximum sequence length). The model has 4,192 parameters.
- Helper Functions:
linear(x, weight, bias): Performs matrix-vector multiplication, fundamental for learned linear transformations.softmax(logits): Converts raw logits into a probability distribution, ensuring values are in and sum to . It uses a numerically stable implementation by subtracting the maximum logit before exponentiation.rmsnorm(x, weight): Re-normalizes a vector to have unit root mean square (RMS) amplitude, preventing activations from growing or shrinking excessively, thus stabilizing training.
- Model Structure:
- Embeddings: Token IDs are mapped to
token embeddings() and position IDs toposition embeddings(). These two vectors are summed element-wise to encode both "what" the token is and "where" it is in the sequence. - Attention Block: The core communication mechanism. Each token is projected into a Query (Q), Key (K), and Value (V) vector. Self-attention calculates attention scores by taking the dot product of the current token's Q with all preceding tokens' K vectors (scaled by ). These scores are then softmaxed to obtain attention weights. The output is a weighted sum of the preceding tokens' V vectors. This process is parallelized across multiple attention heads, and their concatenated outputs are linearly projected. A crucial aspect is the KV Cache, which stores keys and values from previous tokens during both training (unusual) and inference, allowing the model to process one token at a time autoregressively.
- MLP Block: A two-layer feed-forward network. It expands the embedding dimension (typically 4x), applies a ReLU activation, and then projects back to the original dimension. This block performs the "computation" or "thinking" local to each token's position.
- Residual Connections: Both the attention and MLP blocks utilize residual connections, adding their output to their input. This helps gradients flow directly through the network, enabling the training of deeper models.
- Output: The final hidden state is projected to the vocabulary size (27 logits) by the
lm_headlayer. Higher logits indicate a higher probability for that token to be the next in sequence.
- Embeddings: Token IDs are mapped to
- Training Loop:
- The loop iteratively performs: (1) document selection, (2) forward pass, (3) loss calculation, (4) backpropagation, (5) parameter update.
- Tokenization: For each step, a name is selected and wrapped with BOS tokens (e.g., "emma" becomes
[BOS, e, m, m, a, BOS]). The model's objective is to predict each subsequent token given all previous tokens. - Forward Pass and Loss: Tokens are fed one by one. At each position, the model outputs 27 logits. The loss for each position is the negative log-probability of the correct next token (cross-entropy loss), calculated as . The total loss for a document is the average of these per-position losses.
- Backward Pass:
loss.backward()is called once, computing gradients for all parameters. - Adam Optimizer: Instead of simple Stochastic Gradient Descent, Adam is used. It maintains two moving averages per parameter:
m(first moment/momentum) andv(second moment/squared gradient), along with bias corrections (m_hat,v_hat). Parameters are updated as , where is the learning rate. The learning rate linearly decays during training. After update,p.gradis reset to 0.
- Inference (Sampling):
- After training, new names can be generated by sampling.
- Sampling starts with a BOS token. The model predicts logits for the next token, which are converted to probabilities. A token is then randomly sampled from this distribution (optionally controlled by
temperature). - The sampled token is fed back as the next input, and the process repeats until a BOS token is generated (indicating end of sequence) or the maximum sequence length is reached.
- Temperature: A hyperparameter applied before softmax. A temperature of 1.0 samples directly from the learned distribution. Lower temperatures (e.g., 0.5) sharpen the distribution, leading to more conservative and probable choices. Higher temperatures flatten the distribution, resulting in more diverse but potentially less coherent outputs.
Results and Significance:
MicroGPT trains rapidly (approx. 1 minute on a MacBook) and demonstrates a reduction in loss from ~3.3 (random guess) to ~2.37, indicating successful learning of statistical patterns within the name dataset. It can "hallucinate" plausible new names like "kamon" or "karai."
Comparison to Production LLMs:
While MicroGPT embodies the complete algorithmic essence of GPT, production LLMs like ChatGPT differ significantly in scale and engineering:
- Data: Trillions of tokens from internet text vs. 32,000 names. Involves sophisticated data cleaning, filtering, and mixing.
- Tokenization: Subword tokenizers (e.g., BPE) with ~100K token vocabularies vs. character-level.
- Autograd Engine: Tensor-based autograd on GPUs/TPUs processing billions of floating-point operations per second vs. scalar Python implementation.
- Architecture: Billions/trillions of parameters vs. 4,192 parameters. Production models are much wider and deeper, incorporating advanced modules (e.g., RoPE, GQA, MoE). However, the fundamental "attention (communication) and MLP (computation)" structure remains.
- Training Optimization: Large-scale batching, gradient accumulation, mixed-precision training, and sophisticated learning rate schedules (warmup, decay) on thousands of GPUs vs. basic Adam with linear decay.
- Post-training: Production LLMs undergo Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to become conversational agents, transforming a document completion model into a chatbot.
- Inference Serving: Complex engineering stacks for efficient serving to millions of users, including request batching, KV cache management, speculative decoding, and quantization for memory reduction.
MicroGPT clarifies that the "magic" of LLMs lies not in exotic new algorithms but in the scale-up of this fundamental 200-line mechanism, combined with meticulous engineering and extensive data. It demonstrates that models don't "understand" in a human sense but rather learn statistical regularities to predict the next most probable token.