📖 The AI Tool Bible

CompassRank

Public leaderboard from the OpenCompass project ranking open and closed LLMs across 100+ benchmarks.

Free· Free leaderboard; OpenCompass toolkit is Apache 2.0 open sourceEvaluationMulti-model
Visit website →
Best for

Pick CompassRank if you need a third-party, reproducible LLM leaderboard with strong coverage of both Western and Chinese open-source models.

Skip if

Skip it if you want an interactive arena-style human-preference ranking or a polished English-only product experience.

CompassRank is the public-facing leaderboard arm of OpenCompass (司南), the LLM evaluation framework developed by Shanghai AI Laboratory. It aggregates results from running both open-source models (LLaMA, Qwen, InternLM, ChatGLM, Gemma, etc.) and proprietary API models (OpenAI, Claude, Gemini, Baidu, Huawei) through a standardized harness of more than 100 datasets covering roughly 400,000 questions. Models are scored across five capability dimensions using a mix of rule-based and LLM-judge evaluation.

It is aimed at researchers, model trainers, and procurement teams who need a defensible third-party comparison rather than vendor-supplied numbers. The leaderboard itself is free to browse, and the underlying OpenCompass toolkit is Apache 2.0 licensed and pip-installable, so any team can reproduce the runs locally with HuggingFace, LMDeploy, or vLLM backends. Sister projects CompassHub (benchmark browser) and CompassKit (specialized eval toolkits including vision-language) round out the ecosystem.

Caveats: the leaderboard is China-hosted and the documentation skews bilingual Chinese/English, which can slow Western users. Coverage of Chinese-origin models is unusually deep compared to Western leaderboards like Open LLM Leaderboard or LMSYS, which is either a feature or a bias depending on your use case.

Editor's take

This is the most serious open evaluation effort coming out of China and pairs nicely with LMSYS Arena and the HuggingFace Open LLM Leaderboard. The fact that the entire harness is Apache 2.0 and pip-installable is what elevates it above a vanity leaderboard.

— The AI Tool Bible editorial team

Pros

  • Reproducible: every score is generated by the open-source OpenCompass harness
  • Broad coverage of both Western and Chinese LLMs, often missing from other boards
  • 100+ datasets across reasoning, knowledge, language, code, and safety
  • Apache 2.0 toolkit lets you run the same evals on private models

Cons

  • ⚠️ UI and docs are Chinese-first; English coverage is uneven
  • ⚠️ Hosted in mainland China, occasional latency / access issues from abroad
  • ⚠️ Benchmark contamination risks apply as with any static leaderboard

Use cases

llm-benchmarkingmodel-selectionleaderboardsreproducible-evalsvision-language-eval

Explore related

Compare with similar tools

All in Evaluation