LMArena Leaderboard | Compare & Benchmark the Best Frontier AI Models
Key Points
- 1This leaderboard provides a snapshot of leading AI models' performance across various categories including Text, WebDev, Vision, and Text-to-Image.
- 2Detailed rankings with scores and user votes are presented for each arena, alongside an "Arena Overview" comparing models across specific skills like Coding, Math, and Creative Writing.
- 3Top models like gemini-3-pro, grok-4.1-thinking, and Anthropic claude-opus frequently appear in leading positions, demonstrating strong performance across multiple benchmarks.
This document presents a comprehensive leaderboard of leading artificial intelligence models across various modalities, categorized into distinct "Arenas" to provide a snapshot of their performance. The evaluation methodology, implicitly derived from the structure, employs a competitive ranking system where models receive a "Score" and accumulate "Votes," likely reflecting user preferences or blind comparison results within the "Arena" framework. The scores are aggregate performance metrics, and the number of votes indicates the evaluation volume, offering insight into statistical robustness.
The leaderboard is divided into eight primary arenas, each focusing on a specific AI capability:
- Text: Evaluates general text generation and understanding.
- Top models include
gemini-3-pro(Score: 1490, Votes: 25,178),grok-4.1-thinking(Score: 1477, Votes: 25,440), andgemini-3-flash(Score: 1471, Votes: 10,204). Notably, "thinking" variants (e.g.,grok-4.1-thinking,Anthropic claude-opus-4-5-20251101-thinking-32k) frequently appear at the top, suggesting an emphasis on enhanced reasoning capabilities.
- Top models include
- WebDev: Focuses on code generation, web development tasks, and related capabilities.
Anthropic claude-opus-4-5-20251101-thinking-32k(Score: 1511, Votes: 5,730) holds the leading position, followed bygpt-5.2-high(Score: 1481, Votes: 1,647) andAnthropic claude-opus-4-5-20251101(Score: 1479, Votes: 5,445).
- Vision: Assesses performance in image understanding and analysis.
gemini-3-pro(Score: 1302, Votes: 5,564) is the top performer, withgemini-3-flash(Score: 1274, Votes: 2,630) andgemini-3-flash (thinking-minimal)(Score: 1264, Votes: 949) closely following.
- Text-to-Image: Ranks models based on their ability to generate images from textual prompts.
gpt-image-1.5(Score: 1243, Votes: 30,369) leads, followed bygemini-3-pro-image-preview-2k (nano-banana-pro)(Score: 1236, Votes: 27,863) andgemini-3-pro-image-preview (nano-banana-pro)(Score: 1232, Votes: 64,138).
- Image Edit: Measures the effectiveness of models in image manipulation and editing tasks.
chatgpt-image-latest (20251216)(Score: 1417, Votes: 28,444) is ranked first, withgemini-3-pro-image-preview-2k (nano-banana-pro)(Score: 1404, Votes: 124,847) andgpt-image-1.5(Score: 1401, Votes: 173,183) in subsequent positions. Notably,gemini-2.5-flash-image-preview (nano-banana)garnered a significant number of votes (9,654,346).
- Search: Evaluates models for information retrieval and grounded response generation.
gemini-3-pro-grounding(Score: 1214, Votes: 10,182) leads, followed bygpt-5.2-search(Score: 1211, Votes: 6,168) andgpt-5.1-search(Score: 1203, Votes: 8,059).
- Text-to-Video: Ranks models on their capability to generate video content from text descriptions.
veo-3.1-fast-audio(Score: 1375, Votes: 7,351) is the top model, withveo-3.1-audio(Score: 1372, Votes: 7,251) andveo-3-fast-audio(Score: 1360, Votes: 24,043) also performing strongly.
- Image-to-Video: Assesses the generation of video from static images.
veo-3.1-audio(Score: 1401, Votes: 16,058) takes the first rank, followed byveo-3.1-fast-audio(Score: 1391, Votes: 15,875) andwan2.5-i2v-preview(Score: 1346, Votes: 8,847).
The document further provides a detailed "Arena Overview" for a subset of 293 models, primarily focusing on text-based capabilities. This granular evaluation breaks down model performance into specific sub-categories, presenting individual ranks within each: "Overall," "Expert," "Hard Prompts," "Coding," "Math," "Creative Writing," "Instruction Following," and "Longer Query." This allows for a multi-faceted assessment of model strengths and weaknesses. For instance, while gemini-3-pro ranks 1st Overall, Anthropic claude-opus-4-5-20251101-thinking-32k holds the top position in "Expert," "Hard Prompts," "Instruction Following," and "Longer Query," demonstrating specialized proficiency in these demanding areas. The inclusion of "thinking" variants (e.g., grok-4.1-thinking, Anthropic claude-opus-4-5-20251101-thinking-32k) that often leverage explicit reasoning steps indicates an evaluation bias towards advanced cognitive abilities. The leaderboard includes models from a diverse set of developers, reflecting the broad landscape of current AI research and development.