LLM Model Leaderboard
Analyze GPQA, SWE-bench, and AIME reasoning scores combined with real-time throughput volumes for the leading foundation models.
Full Rankings Table
| Rank | Model Name | GPQA | SWE-bench | AIME 2025 | Weekly Vol | Action |
|---|---|---|---|---|---|---|
| 1 | DeepSeek V4 Flash DeepSeek | 88.1% | 79.0% | — | 4.83T | Details |
| 2 | MiMo-V2.5 Xiaomi | 66.7% | 78.9% | — | 4.49T | Website |
| 3 | MiniMax M3 MiniMax | — | 80.5% | — | 3.83T | Website |
| 4 | Owl Alpha OpenRouter | — | — | — | 3.29T | Website |
| 5 | Hy3 preview Tencent | — | — | — | 3.28T | Website |
| 6 | Claude Opus 4.7 Anthropic | 94.2% | 87.6% | — | 2.34T | Details |
| 7 | DeepSeek V4 Pro DeepSeek | 90.1% | 80.6% | — | 2.07T | Details |
| 8 | Claude Opus 4.8 Anthropic | 93.6% | 88.6% | — | 2.06T | Details |
| 9 | GLM 5.2 Zhipu AI | 91.2% | — | — | 1.92T | Website |
| 10 | Claude Sonnet 4.6 Anthropic | 89.9% | 79.6% | — | 1.49T | Details |
| 11 | Claude 3.7 Sonnet Anthropic | 84.8% | 70.3% | 54.8% | N/A | Details |
| 12 | DeepSeek-R1 DeepSeek | 82.4% | 73.1% | 93.1% | N/A | Details |
| 13 | GPT-5.5 OpenAI | 93.6% | — | 81.2% | N/A | Website |
| 14 | Claude Fable 5 Anthropic | — | 95.0% | — | N/A | Details |
Theoretical Trends in Foundation Models
The Rise of Test-Time Compute: The industry is shifting from pure pre-training scaling to test-time compute scaling. Models like DeepSeek-R1 allocate more compute during inference to generate, evaluate, and refine internal thoughts, allowing them to solve math and logic problems without increasing parameter count.
Benchmark Saturation and Academic Focus: As models achieve near-perfect scores on simple coding and chat benchmarks, evaluation suites are moving toward high-difficulty, multi-step academic tests (like GPQA) to measure progress in scientific and engineering logic.
Open-Weights Democratization: High-quality open-weights models are disrupting proprietary API monopolies. By hosting models on optimized hardware or using managed router layers, enterprises achieve equivalent reasoning capabilities at a fraction of the cost.
Academic Reasoning & Hard-Benchmark Methodologies
Evaluating foundation models requires metrics that resist simple web-crawled memorization. Standard benchmarks like GSM8k and HumanEval have suffered from data contamination, leading to inflated scores. To measure actual reasoning capabilities, we track three graduate-level benchmarks: GPQA, SWE-bench Verified, and AIME 2025. GPQA (Graduate-Level Google-Proof Q&A) consists of biology, physics, and chemistry questions written by PhD experts, designed to be un-searchable by search engines. This forces the model to synthesize complex concepts rather than retrieving verbatim text. SWE-bench Verified evaluates models on their ability to autonomously resolve real GitHub issues by writing patches, testing them against verified unit suites. AIME 2025 (American Invitational Mathematics Examination) presents mathematical problems that require deep logical formulation, multi-step planning, and algorithmic calculation, challenging even the most advanced reasoning models.
- GPQA PhD Validation: Questions undergo rigorous double-blind verification by PhD peers to ensure they are mathematically sound, highly difficult, and resistant to internet search strategies.
- SWE-bench Verified Standards: A curated subset of SWE-bench where humans have manually validated the unit tests and problem statements to ensure they are fair and solvable.
- AIME Olympiad Logic: High-level algebra, geometry, and combinatorics problems requiring reasoning models to formulate complex proofs and search for numerical solutions.
Weekly Token Telemetry & Platform Scaling Laws
Throughput and token telemetry provide insight into the practical economic adoption of foundation models. Under Kaplan and Chinchilla scaling laws, larger models yield lower perplexity but require massive compute budgets, increasing latency and cost. Telemetry from OpenRouter tracks the total prompt and completion tokens routed to each model. This acts as a real-world proxy for developer preference: a high token count indicates that a model's cost, speed, and capabilities are optimal for commercial apps. Reasoning models like DeepSeek-R1 generate extensive internal chain-of-thought tokens before returning the final answer, which increases completion token count but provides superior accuracy. This telemetry helps analyze the trade-offs developers make between raw speed and reasoning quality.
- Token Telemetry Analysis: Tracking weekly volume helps developers identify shifts in model adoption, such as the sudden growth of cost-efficient flash models or high-end reasoning systems.
- Scaling Law Tradeoffs: Balancing parameter size against operational latency to select the most cost-effective model for production API routing.
- Chain-of-Thought Overhead: Managing the additional completion tokens generated by reasoning models, which increases the latency of the initial response but improves reasoning depth.
Frequently Asked Questions
What is the difference between pre-training compute and test-time compute?
Pre-training compute is spent during the initial training of the model to learn language representations and facts from web-scale data. Test-time compute is spent during inference, allowing the model to generate multiple reasoning paths, perform internal verification, and refine its response before presenting it to the user.
Why is GPQA considered resistant to data contamination?
GPQA questions are written from scratch by subject-matter experts and are kept in private, password-protected repositories. Because the questions do not appear in public datasets, web crawlers cannot scrape them, preventing models from memorizing the answers during pre-training.