
BigCodeBench Leaderboard
Key Points
- 1BigCodeBench is a leaderboard that evaluates LLMs on practical and challenging programming tasks, using both a smaller, difficult "Hard Set" and a larger "Full Set" of benchmarks, with models ranked by Pass@1.
- 2It features "Complete" mode for code completion from structured docstrings and "Instruct" mode for generating code from brief natural language, designed to test a model's coding ability versus its understanding of human intent.
- 3The leaderboard also provides details on evaluation setups, warns about data contamination, and uses symbols to indicate the openness of model weights and data, alongside recommending other benchmarks for a comprehensive assessment.
BigCodeBench is an evaluation framework designed to assess the capabilities of Large Language Models (LLMs) in solving practical and challenging programming tasks. The leaderboard ranks models primarily based on their (calibrated) Pass@1 score, achieved using greedy decoding.
The evaluation suite is categorized into two primary sets of tasks:
- Hard Set: A focused subset comprising approximately 150 tasks, specifically selected for being more user-facing and inherently challenging.
- Full Set: The comprehensive collection of 1140 tasks available within the BigCodeBench benchmark.
BigCodeBench employs two distinct variants for evaluating code generation, each targeting different aspects of an LLM's proficiency:
- Complete (Code Completion): In this variant, models are tasked with completing code given a structured, long-context docstring. This setup directly evaluates the model's core coding ability and its understanding of detailed, contextual programming requirements.
- Instruct (Code Generation from Instructions): This variant assesses the model's capacity to understand and translate brief, natural language (NL-oriented) instructions into functional code. It serves as a "vibe check" on the model's real-world applicability by testing its capability to interpret human intent.
Specific evaluation setups are indicated by distinct symbols:
- ๐ง : Denotes an evaluation where response prefilling during generation is omitted, potentially encouraging a more explicit reasoning process within the model.
- โจ: Identifies models evaluated within a chat setting, contrasting with others that are assessed via direct code completion.
The leaderboard emphasizes critical considerations regarding model transparency and data integrity:
- "Size" refers to the number of model parameters utilized during inference.
- Model providers bear the responsibility for preventing data contamination, particularly for models trained on closely related datasets.
- Models are categorized by their open-source status: ๐ signifies open weights and open data, while ๐ indicates open weights and open Supervised Fine-Tuning (SFT) data, but with a non-data-open base model. This distinction is made to provide clarity on the potential for reasoning about data contamination.
Beyond BigCodeBench, the document recommends a diverse array of other leaderboards and benchmarks for a comprehensive understanding of LLM coding ability, including SWE Arena, EvalPlus Leaderboard, Spider 2.0, Chatbot Arena Leaderboard, CrossCodeEval, ClassEval, CRUXEval, Code Lingua, Evo-Eval, HumanEval.jl (Julia version), HumanEval with EvalPlus test cases, InfiCoder-Eval, LiveCodeBench, NaturalCodeBench, RepoBench, SWE-bench, and TabbyML Leaderboard OOP. Acknowledgements are extended to the EvalPlus team for the leaderboard template and to the BigCode community for their significant contributions.