Best AI tools for evals datasets
10 tools in the Evaluation category, filtered to evals datasets.
Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Humanloop
Prompt management + evals for collaborative AI teams.
Patronus
Automated LLM evaluation for hallucinations, safety, and quality.
CompassRank
Public leaderboard from the OpenCompass project ranking open and closed LLMs across 100+ benchmarks.
Inspect AI
Open-source LLM evaluation framework from the UK AI Security Institute with 200+ built-in benchmarks.
Maxim AI
End-to-end evaluation, simulation, and observability platform for shipping production-grade AI agents.
OpenAI Evals
OpenAI's open-source framework for benchmarking LLMs against a shared registry of evaluations.
Promptfoo
Open-source eval and red-teaming framework for LLM apps, prompts, and RAG pipelines.
Respan (formerly Keywords AI)
LLM engineering platform combining a multi-model gateway with tracing, evals, and prompt management.