microgpt - GPT Training and Inference Implemented in 200 Lines of Pure Python | GeekNews
Blog

microgpt - GPT Training and Inference Implemented in 200 Lines of Pure Python | GeekNews

xguru
2026.02.20
·News·by 배레온/부산/개발자
#Autograd#GPT#LLM#Python#Transformer

Key Points

  • 1microgpt is a 200-line pure Python implementation of a complete GPT model, created by Karpathy as an "art project" to distill the core algorithmic essence of large language models to their irreducible minimum.
  • 2This self-contained project includes all fundamental components: a character-level tokenizer, a custom scalar-unit autograd engine (Value class), a small GPT-2-like Transformer architecture, and an Adam optimizer, capable of training and inference on a dataset of 32,000 names.
  • 3While demonstrating the full algorithmic loop from data ingestion to new name generation, microgpt highlights that the fundamental mathematics are identical to massive production LLMs, which differ mainly in scale, engineering optimizations (e.g., tensor-based autograd, larger models, BPE tokenizers, advanced optimization), and post-training processes like SFT and RLHF.

MicroGPT is a pure Python implementation of the core GPT algorithm, condensed into approximately 200 lines of code without external dependencies. Developed by Andrej Karpathy, it serves as an educational art project to demystify Large Language Models (LLMs) by presenting their "irreducible minimum" algorithmic essence. It builds upon previous projects like micrograd, makemore, and nanogpt, consolidating their concepts into a single file that includes dataset handling, tokenization, an autograd engine, a GPT-2-like Transformer architecture, an Adam optimizer, and full training and inference loops.

Core Methodology and Technical Details:

  1. Dataset and Tokenization:
    • Dataset: MicroGPT uses a simple dataset of 32,000 names, with each name treated as a "document." The model learns statistical patterns within these names to generate new, plausible names.
    • Tokenization: A character-level tokenizer is employed. Each unique lowercase letter (a-z) is assigned an integer ID, along with a special Beginning of Sequence (BOS) token. A name like "emma" is tokenized as [BOS, e, m, m, a, BOS]. The total vocabulary size is 27 (26 letters + 1 BOS).
  1. Autograd (Automatic Differentiation):
    • The core of training is computing gradients, implemented via a custom Value class.
    • Each Value object encapsulates a scalar floating-point number (.data) and maintains a record of the operation that produced it (_op), its input Value objects (_children), and the partial derivatives of that operation with respect to its inputs (_local_grads).
    • Supported operations include addition, multiplication, exponentiation, logarithm, exponential, and ReLU.
    • The backward() method is called on the final scalar loss Value object. It initiates a gradient of 1 for the loss (self.grad=1self.grad = 1) and then traverses the computational graph in reverse topological order. At each Value node, it applies the chain rule, multiplying the incoming gradient (from its parent in the backward pass) by its _local_grads to compute gradients for its _children. Gradients are accumulated using +=+= (not assignment) to correctly handle paths where a Value contributes to the loss through multiple subsequent operations. This scalar-based autograd system is algorithmically identical to PyTorch's backward() but operates on individual scalars instead of tensors.
  1. Model Architecture:
    • MicroGPT implements a simplified GPT-2 architecture. It is a stateless function that takes tokens, position information, parameters, and cached keys/values to output logits for the next token.
    • Hyperparameters: nembd=16n_embd = 16 (embedding dimension), nhead=4n_head = 4 (attention heads), nlayer=1n_layer = 1 (number of layers), blocksize=16block_size = 16 (maximum sequence length). The model has 4,192 parameters.
    • Helper Functions:
      • linear(x, weight, bias): Performs matrix-vector multiplication, fundamental for learned linear transformations.
      • softmax(logits): Converts raw logits into a probability distribution, ensuring values are in [0,1][0,1] and sum to 11. It uses a numerically stable implementation by subtracting the maximum logit before exponentiation.
      • rmsnorm(x, weight): Re-normalizes a vector to have unit root mean square (RMS) amplitude, preventing activations from growing or shrinking excessively, thus stabilizing training.
    • Model Structure:
      • Embeddings: Token IDs are mapped to token embeddings (Wte\mathbf{W_{te}}) and position IDs to position embeddings (Wpe\mathbf{W_{pe}}). These two vectors are summed element-wise to encode both "what" the token is and "where" it is in the sequence.
      • Attention Block: The core communication mechanism. Each token is projected into a Query (Q), Key (K), and Value (V) vector. Self-attention calculates attention scores by taking the dot product of the current token's Q with all preceding tokens' K vectors (scaled by 1dhead\frac{1}{\sqrt{d_{head}}}). These scores are then softmaxed to obtain attention weights. The output is a weighted sum of the preceding tokens' V vectors. This process is parallelized across multiple attention heads, and their concatenated outputs are linearly projected. A crucial aspect is the KV Cache, which stores keys and values from previous tokens during both training (unusual) and inference, allowing the model to process one token at a time autoregressively.
      • MLP Block: A two-layer feed-forward network. It expands the embedding dimension (typically 4x), applies a ReLU activation, and then projects back to the original dimension. This block performs the "computation" or "thinking" local to each token's position.
      • Residual Connections: Both the attention and MLP blocks utilize residual connections, adding their output to their input. This helps gradients flow directly through the network, enabling the training of deeper models.
      • Output: The final hidden state is projected to the vocabulary size (27 logits) by the lm_head layer. Higher logits indicate a higher probability for that token to be the next in sequence.
  1. Training Loop:
    • The loop iteratively performs: (1) document selection, (2) forward pass, (3) loss calculation, (4) backpropagation, (5) parameter update.
    • Tokenization: For each step, a name is selected and wrapped with BOS tokens (e.g., "emma" becomes [BOS, e, m, m, a, BOS]). The model's objective is to predict each subsequent token given all previous tokens.
    • Forward Pass and Loss: Tokens are fed one by one. At each position, the model outputs 27 logits. The loss for each position is the negative log-probability of the correct next token (cross-entropy loss), calculated as logp(target token)-\log p(\text{target token}). The total loss for a document is the average of these per-position losses.
    • Backward Pass: loss.backward() is called once, computing gradients for all parameters.
    • Adam Optimizer: Instead of simple Stochastic Gradient Descent, Adam is used. It maintains two moving averages per parameter: m (first moment/momentum) and v (second moment/squared gradient), along with bias corrections (m_hat, v_hat). Parameters are updated as θθαm^v^+ϵ\theta \leftarrow \theta - \alpha \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}, where α\alpha is the learning rate. The learning rate linearly decays during training. After update, p.grad is reset to 0.
  1. Inference (Sampling):
    • After training, new names can be generated by sampling.
    • Sampling starts with a BOS token. The model predicts logits for the next token, which are converted to probabilities. A token is then randomly sampled from this distribution (optionally controlled by temperature).
    • The sampled token is fed back as the next input, and the process repeats until a BOS token is generated (indicating end of sequence) or the maximum sequence length is reached.
    • Temperature: A hyperparameter applied before softmax. A temperature of 1.0 samples directly from the learned distribution. Lower temperatures (e.g., 0.5) sharpen the distribution, leading to more conservative and probable choices. Higher temperatures flatten the distribution, resulting in more diverse but potentially less coherent outputs.

Results and Significance:

MicroGPT trains rapidly (approx. 1 minute on a MacBook) and demonstrates a reduction in loss from ~3.3 (random guess) to ~2.37, indicating successful learning of statistical patterns within the name dataset. It can "hallucinate" plausible new names like "kamon" or "karai."

Comparison to Production LLMs:

While MicroGPT embodies the complete algorithmic essence of GPT, production LLMs like ChatGPT differ significantly in scale and engineering:

  • Data: Trillions of tokens from internet text vs. 32,000 names. Involves sophisticated data cleaning, filtering, and mixing.
  • Tokenization: Subword tokenizers (e.g., BPE) with ~100K token vocabularies vs. character-level.
  • Autograd Engine: Tensor-based autograd on GPUs/TPUs processing billions of floating-point operations per second vs. scalar Python implementation.
  • Architecture: Billions/trillions of parameters vs. 4,192 parameters. Production models are much wider and deeper, incorporating advanced modules (e.g., RoPE, GQA, MoE). However, the fundamental "attention (communication) and MLP (computation)" structure remains.
  • Training Optimization: Large-scale batching, gradient accumulation, mixed-precision training, and sophisticated learning rate schedules (warmup, decay) on thousands of GPUs vs. basic Adam with linear decay.
  • Post-training: Production LLMs undergo Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to become conversational agents, transforming a document completion model into a chatbot.
  • Inference Serving: Complex engineering stacks for efficient serving to millions of users, including request batching, KV cache management, speculative decoding, and quantization for memory reduction.

MicroGPT clarifies that the "magic" of LLMs lies not in exotic new algorithms but in the scale-up of this fundamental 200-line mechanism, combined with meticulous engineering and extensive data. It demonstrates that models don't "understand" in a human sense but rather learn statistical regularities to predict the next most probable token.