LLM Stats
Live leaderboard and side-by-side comparison hub for 300+ frontier LLMs across reasoning, coding, and multimodal benchmarks.
Pick LLM Stats if you need a fast, opinion-free snapshot of where every major model lands on price, speed, and standard benchmarks before you commit to one.
Skip it if you need rigorous, task-specific evals on your own data or audit-grade methodology disclosure for procurement.
LLM Stats is a free public dashboard that ranks and compares large language models on standardized benchmarks (GPQA Diamond, SWE-Bench Verified, MMLU/MMLU-Pro, AIME 2025, MATH, HumanEval, MMMU, LiveCodeBench) alongside live throughput, time-to-first-token, context window, and per-million-token pricing. The site aggregates 300+ models from OpenAI, Anthropic, Google, Meta, Mistral, Qwen and others into category-specific leaderboards for coding, writing, math, research, and image/video generation, plus head-to-head 'arena' tools where you can blind-test outputs.
It is aimed at developers, ML engineers, and procurement leads trying to decide which model to ship to production without running their own evals. The cost columns make it especially useful for back-of-envelope budget math, and the SWE-Bench/LiveCodeBench surfaces are some of the cleanest public views of coding-model rank. The platform itself is free; pricing shown is whatever each upstream provider charges for their model.
LLM Stats is operated in conjunction with ZeroEval, which provides the underlying gateway/eval infrastructure and a documented LLM-gateway API at docs.zeroeval.com. There is no open-source repo of the leaderboard itself, and benchmark coverage skews toward the well-known public sets, so treat it as a starting point rather than a substitute for your own task-specific evaluation.
A genuinely useful one-stop dashboard that has become a default bookmark for anyone tracking the frontier-model race. The breadth and cost columns are the draw; just remember that public benchmarks are leaky proxies and the real test is always your own prompts on your own traffic.
— The AI Tool Bible editorial team
Pros
- ✅ Covers 300+ models with both benchmark scores and live latency/throughput
- ✅ Side-by-side price-per-million-token columns make cost comparison trivial
- ✅ Task-specific leaderboards (coding, math, research) instead of one global rank
- ✅ Interactive arenas let you sanity-check outputs before committing to a provider
Cons
- ⚠️ Relies on public benchmarks that frontier labs increasingly train against
- ⚠️ Leaderboard itself is not open source and methodology is lightly documented
- ⚠️ No first-party cost calculator or workload simulator for real traffic patterns
Use cases
Explore related
Compare with similar tools
All in Evaluation →Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.