LMArena Leaderboard | Compare & Benchmark the Best Frontier AI Models
Service

LMArena Leaderboard | Compare & Benchmark the Best Frontier AI Models

2025.06.08
·Web·by Anonymous
#LLM#AI Models#Benchmarking#Leaderboard#AI

Key Points

  • 1This leaderboard provides a snapshot of leading AI models' performance across various categories including Text, WebDev, Vision, and Text-to-Image.
  • 2Detailed rankings with scores and user votes are presented for each arena, alongside an "Arena Overview" comparing models across specific skills like Coding, Math, and Creative Writing.
  • 3Top models like gemini-3-pro, grok-4.1-thinking, and Anthropic claude-opus frequently appear in leading positions, demonstrating strong performance across multiple benchmarks.

This document presents a comprehensive leaderboard of leading artificial intelligence models across various modalities, categorized into distinct "Arenas" to provide a snapshot of their performance. The evaluation methodology, implicitly derived from the structure, employs a competitive ranking system where models receive a "Score" and accumulate "Votes," likely reflecting user preferences or blind comparison results within the "Arena" framework. The scores are aggregate performance metrics, and the number of votes indicates the evaluation volume, offering insight into statistical robustness.

The leaderboard is divided into eight primary arenas, each focusing on a specific AI capability:

  1. Text: Evaluates general text generation and understanding.
    • Top models include gemini-3-pro (Score: 1490, Votes: 25,178), grok-4.1-thinking (Score: 1477, Votes: 25,440), and gemini-3-flash (Score: 1471, Votes: 10,204). Notably, "thinking" variants (e.g., grok-4.1-thinking, Anthropic claude-opus-4-5-20251101-thinking-32k) frequently appear at the top, suggesting an emphasis on enhanced reasoning capabilities.
  1. WebDev: Focuses on code generation, web development tasks, and related capabilities.
    • Anthropic claude-opus-4-5-20251101-thinking-32k (Score: 1511, Votes: 5,730) holds the leading position, followed by gpt-5.2-high (Score: 1481, Votes: 1,647) and Anthropic claude-opus-4-5-20251101 (Score: 1479, Votes: 5,445).
  1. Vision: Assesses performance in image understanding and analysis.
    • gemini-3-pro (Score: 1302, Votes: 5,564) is the top performer, with gemini-3-flash (Score: 1274, Votes: 2,630) and gemini-3-flash (thinking-minimal) (Score: 1264, Votes: 949) closely following.
  1. Text-to-Image: Ranks models based on their ability to generate images from textual prompts.
    • gpt-image-1.5 (Score: 1243, Votes: 30,369) leads, followed by gemini-3-pro-image-preview-2k (nano-banana-pro) (Score: 1236, Votes: 27,863) and gemini-3-pro-image-preview (nano-banana-pro) (Score: 1232, Votes: 64,138).
  1. Image Edit: Measures the effectiveness of models in image manipulation and editing tasks.
    • chatgpt-image-latest (20251216) (Score: 1417, Votes: 28,444) is ranked first, with gemini-3-pro-image-preview-2k (nano-banana-pro) (Score: 1404, Votes: 124,847) and gpt-image-1.5 (Score: 1401, Votes: 173,183) in subsequent positions. Notably, gemini-2.5-flash-image-preview (nano-banana) garnered a significant number of votes (9,654,346).
  1. Search: Evaluates models for information retrieval and grounded response generation.
    • gemini-3-pro-grounding (Score: 1214, Votes: 10,182) leads, followed by gpt-5.2-search (Score: 1211, Votes: 6,168) and gpt-5.1-search (Score: 1203, Votes: 8,059).
  1. Text-to-Video: Ranks models on their capability to generate video content from text descriptions.
    • veo-3.1-fast-audio (Score: 1375, Votes: 7,351) is the top model, with veo-3.1-audio (Score: 1372, Votes: 7,251) and veo-3-fast-audio (Score: 1360, Votes: 24,043) also performing strongly.
  1. Image-to-Video: Assesses the generation of video from static images.
    • veo-3.1-audio (Score: 1401, Votes: 16,058) takes the first rank, followed by veo-3.1-fast-audio (Score: 1391, Votes: 15,875) and wan2.5-i2v-preview (Score: 1346, Votes: 8,847).

The document further provides a detailed "Arena Overview" for a subset of 293 models, primarily focusing on text-based capabilities. This granular evaluation breaks down model performance into specific sub-categories, presenting individual ranks within each: "Overall," "Expert," "Hard Prompts," "Coding," "Math," "Creative Writing," "Instruction Following," and "Longer Query." This allows for a multi-faceted assessment of model strengths and weaknesses. For instance, while gemini-3-pro ranks 1st Overall, Anthropic claude-opus-4-5-20251101-thinking-32k holds the top position in "Expert," "Hard Prompts," "Instruction Following," and "Longer Query," demonstrating specialized proficiency in these demanding areas. The inclusion of "thinking" variants (e.g., grok-4.1-thinking, Anthropic claude-opus-4-5-20251101-thinking-32k) that often leverage explicit reasoning steps indicates an evaluation bias towards advanced cognitive abilities. The leaderboard includes models from a diverse set of developers, reflecting the broad landscape of current AI research and development.