LiveBench

✓ Editorially verified

Contamination-free LLM benchmark that refreshes its questions monthly to keep frontier models honest.

Free· Free and open source; self-hosted evaluation runnerEvaluationMulti-model

Best for

Pick LiveBench if you want a credible, contamination-resistant signal when comparing frontier LLMs or validating a fine-tune against a moving target.

Skip if

Skip it if you need a turnkey SaaS eval platform with hosted runs, custom datasets, and SLA support — this is a research benchmark, not a product.

LiveBench is an open evaluation framework and public leaderboard for large language models, built around the premise that any static benchmark is one training run away from being memorised. It scores models across six domains, reasoning, math, coding, language, data analysis, and instruction following, using objectively verifiable ground-truth tasks rather than an LLM-as-judge. New questions, sourced from recent arXiv papers, news, IMDb synopses, and freshly-released datasets, are added on a monthly cadence so the test set is always partly novel.

The project was introduced by Colin White, Samuel Dooley, Manley Roberts and collaborators (Abacus.AI, NYU, Nvidia, Meta and others) and received a Spotlight at ICLR 2025. It is the benchmark that vendor launch posts increasingly cite alongside MMLU and GPQA, and the leaderboard at livebench.ai is the headline artefact: filter by category, compare frontier closed models (GPT, Claude, Gemini, Grok) against open-weights contenders (Llama, Qwen, DeepSeek, Mistral), and inspect per-task scores. The runner, prompts and scoring code all live in the livebench/livebench GitHub repo, so you can replicate scores or evaluate your own fine-tunes.

It is a research project, not a SaaS: there is no hosted evaluation API, no billing, and no enterprise tier. You either read the leaderboard or you clone the repo and run it yourself against an OpenAI-compatible endpoint.

Editor's take

LiveBench has quietly become one of the few public benchmarks worth quoting in 2026, precisely because its monthly refresh forces labs to ship genuine capability gains rather than overfit to a fixed test. Treat it as a sanity check alongside your own task-specific evals, not a substitute for them.

— The AI Tool Bible editorial team