LiveBench
✓ Editorially verifiedContamination-free LLM benchmark that refreshes its questions monthly to keep frontier models honest.
Pick LiveBench if you want a credible, contamination-resistant signal when comparing frontier LLMs or validating a fine-tune against a moving target.
Skip it if you need a turnkey SaaS eval platform with hosted runs, custom datasets, and SLA support — this is a research benchmark, not a product.
LiveBench is an open evaluation framework and public leaderboard for large language models, built around the premise that any static benchmark is one training run away from being memorised. It scores models across six domains, reasoning, math, coding, language, data analysis, and instruction following, using objectively verifiable ground-truth tasks rather than an LLM-as-judge. New questions, sourced from recent arXiv papers, news, IMDb synopses, and freshly-released datasets, are added on a monthly cadence so the test set is always partly novel.
The project was introduced by Colin White, Samuel Dooley, Manley Roberts and collaborators (Abacus.AI, NYU, Nvidia, Meta and others) and received a Spotlight at ICLR 2025. It is the benchmark that vendor launch posts increasingly cite alongside MMLU and GPQA, and the leaderboard at livebench.ai is the headline artefact: filter by category, compare frontier closed models (GPT, Claude, Gemini, Grok) against open-weights contenders (Llama, Qwen, DeepSeek, Mistral), and inspect per-task scores. The runner, prompts and scoring code all live in the livebench/livebench GitHub repo, so you can replicate scores or evaluate your own fine-tunes.
It is a research project, not a SaaS: there is no hosted evaluation API, no billing, and no enterprise tier. You either read the leaderboard or you clone the repo and run it yourself against an OpenAI-compatible endpoint.
LiveBench has quietly become one of the few public benchmarks worth quoting in 2026, precisely because its monthly refresh forces labs to ship genuine capability gains rather than overfit to a fixed test. Treat it as a sanity check alongside your own task-specific evals, not a substitute for them.
— The AI Tool Bible editorial team
Pros
- ✅ Monthly question refresh meaningfully blunts training-set contamination
- ✅ Objective auto-scoring with ground truth, no LLM-judge bias
- ✅ Covers six diverse domains including reasoning, code and math
- ✅ Fully open source; reproduce scores or evaluate your own model
- ✅ Cited by frontier labs, so scores travel in industry discussions
Cons
- ⚠️ No hosted API; you must run the eval harness yourself
- ⚠️ Leaderboard UI is functional but spartan compared to commercial dashboards
- ⚠️ Monthly cadence still leaves a window where recent questions can leak
Use cases
Explore related
Compare with similar tools
All in Evaluation →Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.