GitHub - openai/parameter-golf: Train the smallest LM you can that fits in 16MB. Best model wins!
Key Points
- 1The OpenAI Model Craft Challenge: Parameter Golf invites participants to train the best language model within a strict 16MB artifact size limit.
- 2Models must be trained in under 10 minutes on 8xH100s and are evaluated by compression (bits per byte) on the FineWeb validation set.
- 3The challenge offers $1,000,000 in compute credits and potential hiring opportunities, with leaderboard submissions requiring statistically significant performance improvements.
The OpenAI Model Craft Challenge: Parameter Golf is a competition designed to push the boundaries of language model efficiency under strict resource constraints. The primary objective is to train the best language model that meets two key limitations: fitting within a 16MB artifact size and training in under 10 minutes on 8x NVIDIA H100 GPUs.
The challenge evaluates models based on their compression performance, specifically the bits per byte (bpb) on the FineWeb validation set, using a tokenizer-agnostic approach. This competition is conceptualized as an L(N) optimization problem, where participants aim to achieve the lowest possible loss given a fixed number of parameters (N), unconstrained by data, compute steps, or specific architecture, drawing inspiration from similar challenges like NanoGPT Speedrun and Slowrun.
Core Constraints and Evaluation:
- Artifact Size: The total submission artifact, comprising code bytes and compressed model bytes, must not exceed 16,000,000 (16MB decimal). This implies aggressive model compression and efficient code.
- Training Time: Training must be completed in under 10 minutes on a cluster of 8x H100 GPUs (SXM variant).
- Evaluation Time: Model evaluation on the FineWeb validation set also has a 10-minute limit on 8x H100s.
- Self-Contained Artifact: Submissions must be fully self-contained, with no external downloads, network calls, or access to training datasets during evaluation. Crucially, access to validation data during training is strictly prohibited to prevent overfitting to the evaluation metric.
- Metric: The core evaluation metric is
val_bpb(validation bits per byte), which quantifies the compression efficiency of the model's output on unseen data.
Encouraged Methodologies and Innovations:
The challenge actively encourages participants to explore novel approaches across various domains to achieve high performance within the stringent constraints:
- Unique Architectures: This includes, but is not limited to, test-time compute, aggressive parameter tying (e.g., tied embeddings), depth recurrence, and low-rank training (e.g., LORAs for test-time training). The baseline model provided is a 9-layer transformer with 512 dimensions, 1024 vocabulary, tied embeddings, and 4 KV heads.
- Compression Schemes: Innovations in this area are critical, such as low-precision training (e.g., int5, int6, int8 quantization), Quantization-Aware Training (QAT), BitNets, and novel tokenizers. Zstandard compression (zstd-22) is also mentioned as a method to reduce the final model size.
- Other Creative Submissions: Test-time training (only on already evaluated tokens), utilization of long context windows (e.g., 4k sequence length), and optimization through custom megakernels are also suggested avenues.
- Training Techniques: Leaderboard entries highlight techniques like Stochastic Weight Averaging (SWA), orthogonal initialization (OrthoInit), Muon WD, SmearGate, spectral embedding initialization, and residual mixing. Techniques like BigramHash are used for efficient parameter usage.
Submission and Verification:
Submissions are made via Pull Requests, requiring a detailed README.md, a submission.json with metadata (including the val_bpb score), and a training log demonstrating a statistically significant improvement (at least 0.005 nats with ) over the current SOTA. Reproducibility is paramount, and non-reproducible results can lead to disqualification. While arbitrary Python packages can be imported, they must not violate the spirit of the rules regarding compute, code size, or capabilities.
OpenAI provides $1,000,000 in compute credits to help participants, with a strong recommendation to start with smaller, cheaper GPU instances (e.g., 1x H100) before scaling to the full 8x H100 setup for official submissions.