LIVE · 2026.03.08 · v2.1
70
Models
9
Tabs
24
Open

ALL Bench Leaderboard 2026

The only leaderboard covering LLM · VLM · Agent · Image · Video · Music in one place. 42 LLMs + 11 VLMs + 28 generative models. All scores cross-verified.

🔥 v2.1 — Confidence System + Intelligence Report · 🌙 Dark mode · 📱 Mobile ready · 🇰🇷 K-EXAONE data from official Technical Report
🤗 HuggingFace Dataset ⚡ GitHub Repo 🧬 FINAL Bench Dataset 🏆 FINAL Bench Leaderboard
📊 Leaderboard
👁 VLM
🤖 Agent
🖼 Image
🎬 Video
🎵 Music
🔍 Tools
📄 Report
📈 Charts
📎 Info
🏆 ALL Bench Composite Score Ranking
√Coverage Score = Avg × √(N/10) · 10 benchmarks (v1.5: LCB replaces SWE-V) · ✓Full(7+) ◐Partial(4-6) ○Limited(<4) · Colored by provider
Filter:
Data Confidence: ✓✓Cross-verified (2+ sources) Single source ~Self-reported Hover score badges for source details · Verified: 2026-03-08
Model Provider 🏆 Score 📅 Release 📚 MMLU-Pro 🧠 GPQA◆ 📐 AIME25 🔭 HLE 🧩 ARC-AGI-2★ 🧬 Metacog★ 🏗 SWE-Pro 🔧 BFCL 📋 IFEval 🖥 LCB 🖥 TB2.0★ 🔬 SciCode★ 💻 SWE-V⚠ 🌍 MMMLU 📥 CtxIn 📤 CtxOut ⚡ tok/s ⏱ TTFT 👁 Vision ⚙ Arch 🏆 ELO 📄 License 💰 $/M in
Grade:
S≥90%
A≥75%
B≥60%
C<60%
★ = New in v1.0 💚 Green row = Open-source value pick 🧩 ARC-AGI-2 = arcprize.org official 🧬 Metacog = FINAL-Bench official (8 of 9 tested models in bench) 🖥 TB2.0 = tbench.ai official · 🔬 SciCode = AA independent ⚙ Score = Avg × √(N/10): Full(7+) Partial(4-6) Limited(<4) · v1.5: LCB replaces SWE-V
👁 VISION LANGUAGE v2.1 Flagship + Open-Source SOTA · 15 Models × 10 Key Benchmarks + Detailed Comparison

NEW v2.1: Flagship VLM comparison across 10 multimodal models. Sources: Vals.ai, Google DeepMind, OpenAI official, Anthropic, InternVL3 paper, Qwen official. Confidence: ✓✓ Cross-verified · Single source · ~ Self-reported

🏆 Flagship VLM Comparison · Cross-provider multimodal intelligence ranking
Model MMMU MMMU-Pro MathVista AI2D OCRBench MMStar Hallusion MMBenchEN RealWorldQA VideoMME
⚡ Lightweight / Edge Model Detail · Qwen official 34 benchmarks · 5 models
STEM & PUZZLE MMMUMMMU-ProMathVisionMathVistaWe-MathDynaMathZEROBenchZERO_subVlmsBlindBabyVis
GENERAL VQA & DOCUMENT RealWorldQAMMStarMMBenchENSimpleVQAHallusionOmniDocCharXivMMLongDocCC-OCRAI2DOCRBench
SPATIAL · VIDEO · AGENT · MEDICAL ERQACountBEmbSpatialRefSpatialLingoQAVidMME+VidMMEVidMMMUMLVUMMVUScreenSPOSWorldAndroidTIR-BSLAKEPMC-VQAMedXpert
Flagship VLM:
Gemini 3.1 Pro
Gemini 3 Flash
GPT-5.2
Claude Opus 4.6
Grok 4 Heavy
InternVL3.5-241B
InternVL3-78B
Qwen2.5-VL-72B
Kimi-VL-A3B
GPT-5 (original)

Edge Models:
GPT-5-Nano
Gemini-2.5-FL-Lite
Qwen3-VL-30B-A3B
Qwen3.5-9B
Qwen3.5-4B
Source: Qwen official benchmarks · BabyVision & TIR-Bench show "with CI / without CI"
🤖 AGENT BENCH v2.1 Agentic Capability Comparison — Desktop, Web, Terminal, Tool Use

Sources: Anthropic, OpenAI, Google DeepMind official announcements + Onyx AI, Vellum, NxCode, DataCamp independent reviews.
⚠ Note: Agent scores vary significantly by scaffolding (agent framework). Values shown are best reported across implementations. "—" = not published / not applicable.

Model 🖥 OSWorld 🔧 τ²-bench 🌐 BrowseComp 🖥 TB 2.0 📋 GDPval 🏗 SWE-Pro 🔧 BFCL v4 📱 Android
Agent Benchmarks: OSWorld = desktop GUI · τ²-bench = multi-turn tools · BrowseComp = web research · TB2.0 = terminal · GDPval = professional work · SWE-Pro = SEAL coding · BFCL = function calling
🖼 IMAGE GENERATION v2.1 10 Models — Qualitative & Arena Ranking Comparison

Image generation lacks standardized numeric benchmarks like LLMs. Rankings combine LM Arena Elo, expert reviews (Cliprise, Vellum, Awesome Agents), and community consensus.
Dimensions: Photorealism · Artistic Quality · Text Rendering · Prompt Adherence · Speed · Cost. Ratings: ⬛S (top tier) · 🟦A (strong) · 🟧B (capable) · ⬜C (limited).

Model Provider Release 🏆 Arena 📷 Photo 🎨 Art 📝 Text 🎯 Prompt ⚡ Speed 💰 Cost License
Ratings: ⬛S = Top tier 🟦A = Strong 🟧B = Capable ⬜C = Limited Sources: LM Arena, Cliprise, Vellum, Awesome Agents, community consensus (Feb 2026)
🎬 VIDEO GENERATION v2.1 10 Models — Quality · Motion · Audio · Duration · Resolution · Cost

Sources: LaoZhang AI, Pinggy, RizzGen, CrePal, TeamDay, Awesome Agents (Feb 2026). All models rated on S/A/B/C scale.
2026 breakthroughs: Native audio generation (Veo 3.1, Sora 2, Kling 3.0) · Multi-shot sequences (Kling 3.0) · 4K output (LTX-2) · Open-source parity (Wan 2.6)

Model Provider Release 📷 Quality 🎬 Motion 🔊 Audio 🎯 Prompt ⏱ Max Dur 📐 Max Res 💰 Cost License
Key: Quality = visual fidelity · Motion = physics/consistency · Audio = native sound gen · Prompt = adherence to description · Duration = max single generation · Open = open-source weights available
🎵 MUSIC / AUDIO GEN v2.1 8 Models — Vocal · Instrumental · Lyrics · Duration · Style Range

⚠ No standardized benchmarks exist for music generation. Rankings based on expert reviews, community consensus, and platform capabilities.
Dimensions: Vocal Quality · Instrumental · Lyrics Understanding · Max Duration · Style Range · Commercial Rights

Model Provider Release 🎤 Vocal 🎸 Instru 📝 Lyrics 🎨 Styles ⏱ Max Dur 💰 Cost License
Note: Music AI is the least benchmarked domain. Ratings reflect community consensus + expert reviews. Commercial rights vary — check each provider's terms before publishing.
🔍 INTERACTIVE TOOLS v2.1 Find · Compare · Verify · Visualize — 67 models across all modalities
🔍 Model Finder
⚔ Head-to-Head
📊 Trust Map
🏁 Bar Race
Find your optimal model: Filter by price, capability, and type. Each dot = one model. X = Price · Y = Composite Score. Hover for details. The best value models are in the top-left zone (high score, low cost).
FILTER:
📄 INTELLIGENCE REPORT

🏆 Executive Summary

🥇 Category Winners

📊 Top 10 LLM Ranking

# Model Score Coverage Type Price

💡 Key Insights

Data Confidence: ✓✓ Cross-verified (2+ independent sources) · Single source (provider official) · ~ Self-reported / unverified · No data available
Last verified: · Methodology: 5-Axis Intelligence Framework (Knowledge · Expert Reasoning · Abstract · Metacognition · Execution)

🧩 ARC-AGI-2 — Abstract Reasoning Frontier

Official arcprize.org · Vertical bars by score · Contamination-proof visual reasoning benchmark

Key: Gemini 3.1 Pro leads at 77.1% (verified arcprize.org). Claude Opus 4.6 68.8% · GPT-5.2 52.9% · Kimi K2.5 12.1%. Each model shows distinct reasoning profile — ARC-AGI-2 is the most contamination-proof benchmark.

🧬 Metacog: Baseline → Self-Correction Gain (Δ)

FINAL-Bench official · Baseline FINAL Score vs MetaCog condition · Error Recovery drives 94.8% of gains

Key: Claude Opus 4.6 has lowest baseline (rank 9) but largest Δ gain (+20.13) — strongest self-correction. Kimi K2.5 highest baseline but smallest gain. Declarative–Procedural gap persists across all models.

🕸 Capability Radar — TOP 6 Multi-Axis Profile

MMLU-Pro · GPQA · AIME · HLE · ARC-AGI-2 · MMMLU · Each axis normalized to 100

Key: No single model dominates all axes. Gemini leads MMMLU+HLE, GPT-5.2 leads MMLU-Pro, Kimi K2.5 exceptional on MMLU-Pro 92.0. Different strengths suggest routing strategies.

📊 Capability Domains — Reasoning vs Coding vs Language

Grouped bars: Reasoning avg (GPQA+AIME+HLE) · Coding avg (SWE-Pro+LCB) · Language avg (MMLU-Pro+MMMLU+IFEval)

Key: Claude Opus 4.6 leads Coding domain. Gemini 3.1 Pro leads Language. GPT-5.2 most balanced across all three domains — ideal for general-purpose deployment.

💰 Performance vs Cost — Value Frontier Map

X = Input price log scale ($/M tokens) · Y = Composite Score · Top-left quadrant = elite value zone

Value leaders: DeepSeek V3.2 ($0.14/M, score ~74) and GLM-5 ($0.35/M) offer exceptional open-weight value. GPT-OSS-120B is truly free with competitive performance.

🏭 Provider Strength — Average Score by Company

Average composite score across all models per provider · Shows lab-level consistency

Key: OpenAI strongest average (combining closed+OSS models). Alibaba's Qwen3.5 family shows remarkable breadth. DeepSeek punches above weight with MIT-licensed models.

📅 Intelligence Timeline — Score vs Release Date

Bubble size = context window (log scale) · Color = provider · Rapid capability gains 2025→2026

Key: ~15-point score jump from Jan 2025 to Feb 2026. Feb 2026 releases (GPT-5.2, Gemini 3.1 Pro) establish new ceiling. Context window growth independent of intelligence score.

⚖ Open vs Closed — Distribution Comparison

Score distribution: Open-weight (18 models) vs Closed-API (6 models) · Box plot style with individual points

Key: Open-weight models now overlap significantly with closed-API. Top open models (Kimi K2.5, Qwen3.5-397B) match or exceed many closed offerings — open-source gap is closing.

📐 Benchmark Score Variance — Consistency Analysis

For each benchmark: show min/max/mean across all models · Reveals benchmark difficulty & discrimination power

Key: HLE shows widest variance (7.0–44.9) = best discrimination. ARC-AGI-2 also highly discriminating (12.1–88.1). AIME25 scores cluster high — many models saturating it.

🌡 Full Benchmark Heatmap — 39 Models × 11 Benchmarks

Color intensity = score · White/light = unreported · Indigo = high · Reveals capability patterns across the entire landscape

🧩 ARC-AGI-2 ★NEW — Abstract Reasoning

Tests novel visual pattern completion — cannot be solved by memorization. arcprize.org. Gemini 3.1 Pro 77.1% (verified) · Claude Opus 4.6 68.8% · GPT-5.4 73.3% · GPT-5.2 52.9% · Kimi K2.5 12.1%. Most contamination-proof benchmark available.

🇰🇷 Korean Sovereign AI — National Foundation Model Project

Ministry of Science and ICT "National AI Foundation Model Project" as of 2026.02 — 4 elite teams: LG AI Research (K-EXAONE) · SK Telecom (A.X K1) · Upstage (Solar Open 100B) · Motif Technologies. Plus KT (Mi:dm 2.0) as independent Korea-centric AI.
• 1st evaluation (2026.01.15): 5 teams → 3 teams (Naver Cloud & NC AI eliminated)
• Wildcard round (2026.02.20): Motif Technologies added → 4-team structure
• K-EXAONE: 1st place in evaluation · 72-point avg across 13 benchmarks · AA open-weight top 10 · 236B MoE
• Solar Open 100B: AIME 84.3% · 19.7T tokens · 100B MoE · arXiv 2601.07022
• A.X K1: Korea's first 500B parameter model · Apache 2.0 open-source
• Goal: Achieve 95%+ of global AI model performance · Final 2 teams selected by 2027 · KRW 530B budget

🧬 Metacognitive ★NEW — FINAL-Bench

Official: HF FINAL-Bench/Metacognitive. 100 tasks, 9 SOTA models tested. Baseline FINAL Score: Kimi K2.5 68.71 · GPT-5.2 62.76 · GLM-5 62.50 · MiniMax-M2.5 60.54 · GPT-OSS-120B 60.42 · DeepSeek-V3.2 60.04 · GLM-4.7P 59.54 · Gemini 59.5 · Opus 4.6 56.04. ER (error recovery) accounts for 94.8% of self-correction gains. 8 of 9 tested models now in ALL Bench.

📊 Composite Score — √Coverage Weighted (v1.5)

5-Axis Intelligence Framework:
Knowledge (MMLU-Pro) — 57K questions, highest statistical reliability
Expert Reasoning (GPQA + AIME + HLE) — PhD-level science + math olympiad + frontier-hard
Abstract Reasoning (ARC-AGI-2) — contamination-proof visual pattern recognition
Metacognition (FINAL Bench) — self-correction & error recovery
Execution (SWE-Pro + BFCL + IFEval + LCB) — real coding + tool use + instruction following + competitive programming

Formula: Score = Avg(confirmed) × √(N/10)
• N = confirmed benchmarks out of 10 · √ softens penalty: 10/10=×1.00 · 7/10=×0.84 · 4/10=×0.63
✓ Full (7+) · ◐ Partial (4-6) · ○ Limited (<4)
v1.5 change: SWE-Verified removed from composite (59.4% tasks defective per OpenAI audit). Replaced with LiveCodeBench — continuously updated, contamination-resistant.

📚 MMLU-Pro

HF: TIGER-Lab/MMLU-Pro. 57,000 expert-level questions across disciplines. Largest sample size → highest statistical reliability. Much harder than original MMLU. Gold standard general knowledge benchmark.

🧠 GPQA Diamond ⭐

HF: Idavidrein/gpqa. 198 PhD-level questions in biology, chemistry, physics. Human expert average ~65%. Highest discrimination power among frontier models.

📐 AIME 2025

AoPS: 2025 AIME. American Invitational Mathematics Examination. 2025 problem set minimizes contamination. Tests mathematical reasoning and creative problem solving.

🔭 HLE — Humanity's Last Exam

HF: centerforaisafety/hle. 2,500 expert-submitted questions. Intended to be the final closed-ended academic benchmark. Kimi K2.5 44.9% · Gemini 3.1 Pro 44.7% lead.

🏗 SWE-Pro ⭐ Recommended

scale.com/leaderboard/coding. Scale AI SEAL, 1865 real repos. Contamination-free. ~35pt lower than SWE-Verified — honest measure of real coding. OpenAI recommends over Verified.

💻 SWE-Verified ⚠ Caution

swebench.com. 59.4% of tasks found defective in OpenAI audit. Memorization/contamination risk. Reference only. Prefer SWE-Pro for accurate assessment.

🔧 BFCL v4

gorilla.cs.berkeley.edu. Berkeley Function-Calling Leaderboard. Measures tool use and agent capability. Qwen3.5-122B world #1.

📋 IFEval

HF: google/IFEval. Instruction following evaluation. Verifiable output constraints. Tests precision compliance.

🖥 LiveCodeBench

livecodebench.github.io. Competitive programming from LeetCode, AtCoder, Codeforces. Continuously updated to prevent contamination.

🖥 Terminal-Bench 2.0 ★NEW — Agentic Terminal Tasks

tbench.ai. Stanford + Laude Institute. ~80 tasks: compile code, train models, configure servers, play games, debug systems.
• Best agent+model combo scores: Gemini 3.1 Pro 78.4% · GPT-5.3 Codex 77.3% · Claude Opus 4.6 74.7% · Gemini 3 Flash 64.3%
• Tests real-world terminal capability — distinct from SWE-bench (file editing) · Agent framework matters: same model varies 10-20pts by scaffold
Source: tbench.ai official leaderboard (best model score across all agents)

🔬 SciCode ★NEW — Scientific Research Coding

scicode-bench.github.io. 338 sub-problems from 80 real research tasks across 16 scientific disciplines (Chemistry, Physics, Biology, Math).
• AA independent: Gemini 3.1 Pro 58.9% · Gemini 3 Pro 56.1% · GPT-5.2 Codex 54.6%
• Only 3 model scores publicly available from AA — most models show "—" (data insufficient)
Why included: Fills the "science coding" gap — existing benchmarks (SWE-Pro, LCB) test SE/competitive only

🌍 MMMLU — Multilingual

HF: openai/MMMLU. MMLU in 57 languages. Gemini 3.1 Pro ~88% leads. Qwen3.5 officially supports 201 languages.

⚙ Architecture

MoE = sparse activation (efficient), Dense = full params (quality), Hybrid = DeltaNet+MoE. Parentheses = active/total params. Active params determine inference cost. Qwen3.5-35B: 3B active → 194 tok/s.

⏱ TTFT Latency

Time To First Token (seconds). Lower is faster. Mistral Large 3 0.3s · GPT-5.2 0.6s fastest. Reasoning models (DeepSeek R1 8s) are slower due to chain-of-thought. <2s recommended for real-time apps.

🔥 GPT-5.4 — OpenAI's Most Capable Model (2026.03.05)

OpenAI: Introducing GPT-5.4. Dense reasoning model, Proprietary, released March 5, 2026.
HLE 52.1% — ALL Bench #1 (dethroning Kimi K2.5 44.9%) · GPT-5.4 Pro reaches 58.7%
• ARC-AGI-2: 73.3% (+20pt from GPT-5.2) · Pro: 83.3% (approaching Gemini 3.1 Pro 88.1%)
• SWE-Pro: 57.7% · GPQA: 92.8% · OSWorld 75.0% (surpasses human 72.4%) — first Computer Use SOTA
• 1M context window · Tool Search (47% token reduction) · Native computer use via Playwright + screenshots
• $2.50/M input, $15/M output · Replaces GPT-5.2 Thinking in ChatGPT

🆕 MiniMax-M2.5 — Agent & Coding Frontier

HF: MiniMaxAI/MiniMax-M2.5. MiniMax (China AI Tiger). 230B MoE (10B active), MIT license, 2026.02.
SWE-Verified 80.2% — ALL Bench #1 for real-world software engineering
• GPQA 84.8 · MMLU-Pro 82.0 · AIME 86.3 · IFEval 87.5 · LCB 82.6 · HLE 19.1
• 1M context window · Forge RL framework · 200K+ real-world training environments
• Emergent architectural thinking: plans project hierarchies before coding

🆕 Step-3.5-Flash — Efficiency Frontier MoE

HF: stepfun-ai/Step-3.5-Flash. StepFun (China AI Tiger). 196B MoE (11B active), Apache 2.0, 2026.02.
AIME 97.3% — near-perfect math reasoning with only 11B active params
• LCB 86.4 · SWE-V 74.4 · Terminal-Bench 51.0 · 256K context · 300 tok/s peak
• MIS-PO (Metropolis Independence Sampling) novel RL method · 3:1 SWA ratio
• Runs locally on Mac Studio M4 Max / NVIDIA DGX Spark · arXiv: 2602.10604

🆕 Nanbeige4.1-3B — 3B Small Model Giant Killer

HF: Nanbeige/Nanbeige4.1-3B. Nanbeige LLM Lab (by Kanzhun/BOSS Zhipin). Built on Nanbeige4-3B-Base, optimized via SFT+RL. Apache 2.0.
3B params outperforms Qwen3-32B across the board: GPQA 83.8 (vs 68.4) · LCB 76.9 (vs 55.7) · Arena-Hard-v2 73.2 (vs 56.0)
• First small general model with Deep Search: 500+ rounds tool invocation · GAIA 69.9 · xBench 75
• AIME 2026-I 87.4% · BFCL-V4 56.5 · HLE 12.6 · Multi-Challenge 52.2
Reasoning + Alignment + Agentic achieved simultaneously — new benchmark for small model ecosystem

🆕 Mi:dm 2.0 Base — KT Korea-Centric AI

HF: K-intelligence/Midm-2.0-Base-Instruct. KT (Korea Telecom). 11.5B Dense (Llama + Depth-up Scaling), MIT license, 2025.07.
Korea-centric AI: deeply internalizes Korean social values & commonsense
• Korean Society & Culture avg 78.4% · KMMLU 57.3 · Ko-IFEval 82.0 · Ko-MTBench 89.7
• Outperforms Exaone-3.5-7.8B & Qwen3-14B on Korean evaluation suites
• Function calling support via vLLM · 2.3B Mini variant available for on-device

🆕 Qwen3-Next-80B-A3B — Hybrid Attention Revolution

HF: Qwen/Qwen3-Next-80B-A3B-Thinking. First model in Qwen3-Next series.
Hybrid Attention: Gated DeltaNet + Gated Attention replaces standard attention → efficient ultra-long context
Ultra-Sparse MoE: 80B total, only 3B activated (512 experts, 10 active) → 10x inference throughput
• MMLU-Pro 82.7 · GPQA 77.2 · LCB 68.7 · IFEval 88.9 · MMMLU 81.3 · Multi-Token Prediction (MTP)
• Surpasses Qwen3-30B-A3B-Thinking-2507 & Gemini-2.5-Flash-Thinking · NCML License

👁 Vision Language Tab ★NEW — 34 VL Benchmarks

New tab comparing 5 multimodal models across 34 vision-language benchmarks from Qwen official results.
STEM & Puzzle: MMMU, MMMU-Pro, MathVision, MathVista, We-Math, DynaMath, ZEROBench, VlmsAreBlind, BabyVision
General VQA & Doc: RealWorldQA, MMStar, MMBench, SimpleVQA, HallusionBench, OmniDocBench, CharXiv, CC-OCR, AI2D, OCRBench
Spatial/Video/Agent: ERQA, CountBench, EmbSpatial, LingoQA, VideoMME, MLVU, ScreenSpot Pro, OSWorld, AndroidWorld
Medical: SLAKE, PMC-VQA, MedXpertQA-MM — Qwen3.5-9B leads in nearly all categories

💰 Pricing

Input cost in $/million tokens. 0 = free open-weights. GPT-5-Nano $0.05/M (cheapest frontier). Qwen3.5-35B $0.10/M = Gemini 2.5 FL-Lite $0.10/M. DeepSeek V3.2 $0.14/M. GPT-5.2 $1.75/M · Claude Opus 4.6 $5/M.

📋 Changelog v2.1

v2.1 ✓✓/✓/~ Confidence badges on all benchmark scores with source tooltips. 📄 Intelligence Report tab with Executive Summary, Category Winners, Top 10, Key Insights. PDF/DOCX download. Last verified date tracking (2026-03-08). Source data for 42 models across 12 benchmark columns.
v2.0 All blanks filled: Kimi LCB 85, K-EXAONE MMLU-P 81.8/GPQA 75.4/AIME 85.3, Sonnet 4.6 GPQA 89.9/ARC 60.4, GPT-5.2 LCB 80. Korean AI data from K-EXAONE Technical Report. 42 LLMs cross-verified.
v1.9 +3 LLMs (GPT-5.1, Gemini 3 Pro, Sonnet 4.5). Dark mode. Mobile responsive.
v1.8 Tools tab (Model Finder · Head-to-Head · Trust Map · Bar Race). Header streamlined.
v1.7 Video (10) + Music (8). v1.6 Agent + Image. v1.5 Critical fixes + VLM tab.

✓ Sources & Verification

LLM scores cross-verified against 2+ independent sources: Artificial Analysis Intelligence Index · arcprize.org (ARC-AGI-2 official) · Scale AI SEAL (SWE-Pro) · tbench.ai (Terminal-Bench) · FINAL-Bench/Metacognitive (HF official) · Chatbot Arena · OpenAI/Anthropic/Google official model cards · Vellum · DataCamp · NxCode · digitalapplied. Unverified scores shown as "—" or removed.

ALL Bench Leaderboard v2.1 · 70 AI Models · 📡 API Available · Updated 2026.03.08