Top Foundation Models

LLM Model Leaderboard

Analyze GPQA, SWE-bench, and AIME reasoning scores combined with real-time throughput volumes for the leading foundation models.

Coding Agents LLM Models AI Apps

MiMo-V2.5

Xiaomi

4.49T

Weekly Tokens

External Site

SILVER

DeepSeek V4 Flash

DeepSeek

4.83T

Weekly Tokens

View Details

CHAMPION

MiniMax M3

MiniMax

3.83T

Weekly Tokens

External Site

BRONZE

Full Rankings Table

Rank	Model Name	GPQA	SWE-bench	AIME 2025	Weekly Vol	Action
1	DeepSeek V4 Flash DeepSeek	88.1%	79.0%	—	4.83T	Details
2	MiMo-V2.5 Xiaomi	66.7%	78.9%	—	4.49T	Website
3	MiniMax M3 MiniMax	—	80.5%	—	3.83T	Website
4	Owl Alpha OpenRouter	—	—	—	3.29T	Website
5	Hy3 preview Tencent	—	—	—	3.28T	Website
6	Claude Opus 4.7 Anthropic	94.2%	87.6%	—	2.34T	Details
7	DeepSeek V4 Pro DeepSeek	90.1%	80.6%	—	2.07T	Details
8	Claude Opus 4.8 Anthropic	93.6%	88.6%	—	2.06T	Details
9	GLM 5.2 Zhipu AI	91.2%	—	—	1.92T	Website
10	Claude Sonnet 4.6 Anthropic	89.9%	79.6%	—	1.49T	Details
11	Claude 3.7 Sonnet Anthropic	84.8%	70.3%	54.8%	N/A	Details
12	DeepSeek-R1 DeepSeek	82.4%	73.1%	93.1%	N/A	Details
13	GPT-5.5 OpenAI	93.6%	—	81.2%	N/A	Website
14	Claude Fable 5 Anthropic	—	95.0%	—	N/A	Details

Theoretical Trends in Foundation Models

The Rise of Test-Time Compute: The industry is shifting from pure pre-training scaling to test-time compute scaling. Models like DeepSeek-R1 allocate more compute during inference to generate, evaluate, and refine internal thoughts, allowing them to solve math and logic problems without increasing parameter count.

Benchmark Saturation and Academic Focus: As models achieve near-perfect scores on simple coding and chat benchmarks, evaluation suites are moving toward high-difficulty, multi-step academic tests (like GPQA) to measure progress in scientific and engineering logic.

Open-Weights Democratization: High-quality open-weights models are disrupting proprietary API monopolies. By hosting models on optimized hardware or using managed router layers, enterprises achieve equivalent reasoning capabilities at a fraction of the cost.

Academic Reasoning & Hard-Benchmark Methodologies

Evaluating foundation models requires metrics that resist simple web-crawled memorization. Standard benchmarks like GSM8k and HumanEval have suffered from data contamination, leading to inflated scores. To measure actual reasoning capabilities, we track three graduate-level benchmarks: GPQA, SWE-bench Verified, and AIME 2025. GPQA (Graduate-Level Google-Proof Q&A) consists of biology, physics, and chemistry questions written by PhD experts, designed to be un-searchable by search engines. This forces the model to synthesize complex concepts rather than retrieving verbatim text. SWE-bench Verified evaluates models on their ability to autonomously resolve real GitHub issues by writing patches, testing them against verified unit suites. AIME 2025 (American Invitational Mathematics Examination) presents mathematical problems that require deep logical formulation, multi-step planning, and algorithmic calculation, challenging even the most advanced reasoning models.

GPQA PhD Validation: Questions undergo rigorous double-blind verification by PhD peers to ensure they are mathematically sound, highly difficult, and resistant to internet search strategies.
SWE-bench Verified Standards: A curated subset of SWE-bench where humans have manually validated the unit tests and problem statements to ensure they are fair and solvable.
AIME Olympiad Logic: High-level algebra, geometry, and combinatorics problems requiring reasoning models to formulate complex proofs and search for numerical solutions.

Weekly Token Telemetry & Platform Scaling Laws

Throughput and token telemetry provide insight into the practical economic adoption of foundation models. Under Kaplan and Chinchilla scaling laws, larger models yield lower perplexity but require massive compute budgets, increasing latency and cost. Telemetry from OpenRouter tracks the total prompt and completion tokens routed to each model. This acts as a real-world proxy for developer preference: a high token count indicates that a model's cost, speed, and capabilities are optimal for commercial apps. Reasoning models like DeepSeek-R1 generate extensive internal chain-of-thought tokens before returning the final answer, which increases completion token count but provides superior accuracy. This telemetry helps analyze the trade-offs developers make between raw speed and reasoning quality.

Token Telemetry Analysis: Tracking weekly volume helps developers identify shifts in model adoption, such as the sudden growth of cost-efficient flash models or high-end reasoning systems.
Scaling Law Tradeoffs: Balancing parameter size against operational latency to select the most cost-effective model for production API routing.
Chain-of-Thought Overhead: Managing the additional completion tokens generated by reasoning models, which increases the latency of the initial response but improves reasoning depth.

Frequently Asked Questions

What is the difference between pre-training compute and test-time compute?

Pre-training compute is spent during the initial training of the model to learn language representations and facts from web-scale data. Test-time compute is spent during inference, allowing the model to generate multiple reasoning paths, perform internal verification, and refine its response before presenting it to the user.

Why is GPQA considered resistant to data contamination?

GPQA questions are written from scratch by subject-matter experts and are kept in private, password-protected repositories. Because the questions do not appear in public datasets, web crawlers cannot scrape them, preventing models from memorizing the answers during pre-training.

MiMo-V2.5

DeepSeek V4 Flash

MiniMax M3

Full Rankings Table

Theoretical Trends in Foundation Models

Academic Reasoning & Hard-Benchmark Methodologies

Weekly Token Telemetry & Platform Scaling Laws

Frequently Asked Questions

What is the difference between pre-training compute and test-time compute?

Why is GPQA considered resistant to data contamination?

GetAI Assistant