Top Foundation Models

LLM Model Leaderboard

Analyze GPQA, SWE-bench, and AIME reasoning scores combined with real-time throughput volumes for the leading foundation models.

#2
2

MiMo-V2.5

Xiaomi
4.49T
Weekly Tokens
External Site
SILVER
#1
1

DeepSeek V4 Flash

DeepSeek
4.83T
Weekly Tokens
View Details
CHAMPION
#3
3

MiniMax M3

MiniMax
3.83T
Weekly Tokens
External Site
BRONZE

Full Rankings Table

Rank Model Name GPQA SWE-bench AIME 2025 Weekly Vol Action
1
DeepSeek V4 Flash
DeepSeek
88.1%79.0%4.83TDetails
2
MiMo-V2.5
Xiaomi
66.7%78.9%4.49TWebsite
3
MiniMax M3
MiniMax
80.5%3.83TWebsite
4
Owl Alpha
OpenRouter
3.29TWebsite
5
Hy3 preview
Tencent
3.28TWebsite
6
Claude Opus 4.7
Anthropic
94.2%87.6%2.34TDetails
7
DeepSeek V4 Pro
DeepSeek
90.1%80.6%2.07TDetails
8
Claude Opus 4.8
Anthropic
93.6%88.6%2.06TDetails
9
GLM 5.2
Zhipu AI
91.2%1.92TWebsite
10
Claude Sonnet 4.6
Anthropic
89.9%79.6%1.49TDetails
11
Claude 3.7 Sonnet
Anthropic
84.8%70.3%54.8%N/ADetails
12
DeepSeek-R1
DeepSeek
82.4%73.1%93.1%N/ADetails
13
GPT-5.5
OpenAI
93.6%81.2%N/AWebsite
14
Claude Fable 5
Anthropic
95.0%N/ADetails

Theoretical Trends in Foundation Models

01

The Rise of Test-Time Compute: The industry is shifting from pure pre-training scaling to test-time compute scaling. Models like DeepSeek-R1 allocate more compute during inference to generate, evaluate, and refine internal thoughts, allowing them to solve math and logic problems without increasing parameter count.

02

Benchmark Saturation and Academic Focus: As models achieve near-perfect scores on simple coding and chat benchmarks, evaluation suites are moving toward high-difficulty, multi-step academic tests (like GPQA) to measure progress in scientific and engineering logic.

03

Open-Weights Democratization: High-quality open-weights models are disrupting proprietary API monopolies. By hosting models on optimized hardware or using managed router layers, enterprises achieve equivalent reasoning capabilities at a fraction of the cost.

Academic Reasoning & Hard-Benchmark Methodologies

Evaluating foundation models requires metrics that resist simple web-crawled memorization. Standard benchmarks like GSM8k and HumanEval have suffered from data contamination, leading to inflated scores. To measure actual reasoning capabilities, we track three graduate-level benchmarks: GPQA, SWE-bench Verified, and AIME 2025. GPQA (Graduate-Level Google-Proof Q&A) consists of biology, physics, and chemistry questions written by PhD experts, designed to be un-searchable by search engines. This forces the model to synthesize complex concepts rather than retrieving verbatim text. SWE-bench Verified evaluates models on their ability to autonomously resolve real GitHub issues by writing patches, testing them against verified unit suites. AIME 2025 (American Invitational Mathematics Examination) presents mathematical problems that require deep logical formulation, multi-step planning, and algorithmic calculation, challenging even the most advanced reasoning models.

  • GPQA PhD Validation: Questions undergo rigorous double-blind verification by PhD peers to ensure they are mathematically sound, highly difficult, and resistant to internet search strategies.
  • SWE-bench Verified Standards: A curated subset of SWE-bench where humans have manually validated the unit tests and problem statements to ensure they are fair and solvable.
  • AIME Olympiad Logic: High-level algebra, geometry, and combinatorics problems requiring reasoning models to formulate complex proofs and search for numerical solutions.

Weekly Token Telemetry & Platform Scaling Laws

Throughput and token telemetry provide insight into the practical economic adoption of foundation models. Under Kaplan and Chinchilla scaling laws, larger models yield lower perplexity but require massive compute budgets, increasing latency and cost. Telemetry from OpenRouter tracks the total prompt and completion tokens routed to each model. This acts as a real-world proxy for developer preference: a high token count indicates that a model's cost, speed, and capabilities are optimal for commercial apps. Reasoning models like DeepSeek-R1 generate extensive internal chain-of-thought tokens before returning the final answer, which increases completion token count but provides superior accuracy. This telemetry helps analyze the trade-offs developers make between raw speed and reasoning quality.

  • Token Telemetry Analysis: Tracking weekly volume helps developers identify shifts in model adoption, such as the sudden growth of cost-efficient flash models or high-end reasoning systems.
  • Scaling Law Tradeoffs: Balancing parameter size against operational latency to select the most cost-effective model for production API routing.
  • Chain-of-Thought Overhead: Managing the additional completion tokens generated by reasoning models, which increases the latency of the initial response but improves reasoning depth.

Frequently Asked Questions

What is the difference between pre-training compute and test-time compute?

Pre-training compute is spent during the initial training of the model to learn language representations and facts from web-scale data. Test-time compute is spent during inference, allowing the model to generate multiple reasoning paths, perform internal verification, and refine its response before presenting it to the user.

Why is GPQA considered resistant to data contamination?

GPQA questions are written from scratch by subject-matter experts and are kept in private, password-protected repositories. Because the questions do not appear in public datasets, web crawlers cannot scrape them, preventing models from memorizing the answers during pre-training.

GetAI Assistant

Online & Ready to Chat

GetAI Inteligent Companion